Creating a Simple Bloom Filter (Posted on February 2nd, 2013)

Bloom filters are super efficient data structures that allow us to tell if an object is most likely in a data set or not by checking a few bits. Bloom filters return some false positives but no false negatives. Luckily we can control the amount of false positives we receive with a trade off of time and memory.

You may have never heard of a bloom filter before but you've probably interacted with one at some point. For instance if you use Chrome, Chrome has a bloom filter of malicious URLs. When you visit a website it checks if that domain is in the filter. This prevents you from having to ping Google's servers every time you visit a website to check if it's malicious or not. Large databases such as Cassandra and Hadoop use bloom filters to see if it should do a large query or not.

From Cassandra's Architecture Overview:

Cassandra uses bloom filters to save IO when performing a key lookup: each [Sorted String Table] has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free.

What You'll Need

  • A bit array (as the name suggests it's just an array of bits)
  • A quick non-cryptographic hash function like murmurhash3 or cityhash

For Python you can do:

sudo pip install bitarray
sudo pip install mmh3

How They Work

A bloom filter is essentially a huge bit array. Bloom filters work by hashing an object several times using either multiple hash functions or the same hash function with a different seed. This insures that when we hash an object we're unlikely to get the same result. For each time an object is hashed the corresponding hash value in the bit array is then marked as 1. This is why we use a bit array. Rather than needing say 4 bytes to store a 1 or a 0 we can simply do it in a bit.

Here's an example to help make it easier to understand. Note we mod by the size of the bit array to prevent index out of bounds:

from bitarray import bitarray
import mmh3

bit_array = bitarray(10)
bit_array.setall(0)

b1 = mmh3.hash("hello", 41) % 10 #Equals 0
bit_array[b1] = 1
b2 = mmh3.hash("hello", 42) % 10 #Equals 4
bit_array[b2] = 1

At this point our bit array looks like this: 1000100000

If we want to check if "hello" is in our data set we do the same opreation back.

b1 = mmh3.hash("hello", 41) % 10 #Equals 0
b2 = mmh3.hash("hello", 42) % 10 #Equals 4
if bit_array[b1] == 1 and bit_array[b2] == 1:
    print "Probably in set"
else:
    print "Definitely not in set"

The reason it is only probably in the set is because a combination of items added to the data set could end up setting the same bits to 1. For instance say we have an empty bloom filter and maybe hashing "world" with seed 41 equalled 0 and hashing "foo" with seed 42 equalled 4. Then when you attempt to search for "hello" you'd get that it is probably in the set even though it was never added. You can prevent this from happening by using a larger bit array and more hash functions. There is some math to choosing these numbers that I'll get in to towards the end of the post.

The Class

Alright let's do this! Bloom filters take in 2 variables: size and number of times to run our hash function(s).

from bitarray import bitarray
import mmh3

class BloomFilter:
    
    def __init__(self, size, hash_count):
        self.size = size
        self.hash_count = hash_count
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        
    def add(self, string):
        #Awesome code goes here
        
    def lookup(self, string):
        #Awesome code goes here

Using our knowledge from earlier we can create a function to add items to our filter in just 3 lines of code.

def add(self, string):
    for seed in xrange(self.hash_count):
        result = mmh3.hash(string, seed) % self.size
        self.bit_array[result] = 1

We hash the string with our hash function modded by the size of the bit array and then set that value in our bit array to 1. For each iteration our seed increases by 1 so that we can get different numbers from each hash.

The code for the lookup function is almost the same as the add function. For each hash we're just checking if the resulting value is set to 1 or 0 in our bit array. If we find a 0 then we respond back with "Nope". Otherwise it's probably there so we respond back with "Probably".

def lookup(self, string):
    for seed in xrange(self.hash_count):
        result = mmh3.hash(string, seed) % self.size
        if self.bit_array[result] == 0:
            return "Nope"
    return "Probably"

That's it. That's a bloom filter. Pretty simple right?

Finding The Right Values

The big thing we want to control when creating our bloom filter is our false positive rate. To do this we'll need to have an idea of the size of our data set. Wikipedia has the math and fancy graphs behind choosing your values. Here's what you need to know though:

The formula to calculate the size of your bit array (m = size of bit array, n = expected number of items, and p = probability percentage represented as a decimal):

m equals negative n times the natural log p all over the natural log of 2 sqaured

The formula to calculate the number of hashes to use (k = number of hashes to use):

k equals negative m over n times the natural log of 2

You can also play around with values using this formula which will give you your false positive probability:

false probability equals ( 1 minus [ 1 minus 1 over m] raised to the power k times n ) raised to the power of k

Testing Our Code

The built-in american english dictionary on Ubuntu has about 100,000 words in it so let's use that with a bloom filter. According to WolframAlpha a bloom filter with 1.1 million bits and 7 hash functions has about a 0.5% chance of a false positive.

from bitarray import bitarray
import mmh3

class BloomFilter:
    
    def __init__(self, size, hash_count):
        self.size = size
        self.hash_count = hash_count
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        
    def add(self, string):
        for seed in xrange(self.hash_count):
            result = mmh3.hash(string, seed) % self.size
            self.bit_array[result] = 1
            
    def lookup(self, string):
        for seed in xrange(self.hash_count):
            result = mmh3.hash(string, seed) % self.size
            if self.bit_array[result] == 0:
                return "Nope"
        return "Probably"

bf = BloomFilter(500000, 7)

lines = open("/usr/share/dict/american-english").read().splitlines()
for line in lines:
    bf.add(line)

print bf.lookup("google")
>>> Nope
print bf.lookup("Max")
>>> Probably
print bf.lookup("mice")
>>> Probably
print bf.lookup("3")
>>> Nope

Here's some code we can use to do a time comparison. We'll create a huge list of all our words then iterate through them one by one til we find our target string.

bf = BloomFilter(500000, 7)
huge = []

lines = open("/usr/share/dict/american-english").read().splitlines()
for line in lines:
    bf.add(line)
    huge.append(line)

import datetime

start = datetime.datetime.now()
bf.lookup("google")
finish = datetime.datetime.now()
print (finish-start).microseconds
>>> 57


start = datetime.datetime.now()
for word in huge:
    if word == "google":
        break
finish = datetime.datetime.now()
print (finish-start).microseconds
>>> 9891


start = datetime.datetime.now()
bf.lookup("apple")
finish = datetime.datetime.now()
print (finish-start).microseconds
>>> 10


start = datetime.datetime.now()
for word in huge:
    if word == "apple":
        break
finish = datetime.datetime.now()
print (finish-start).microseconds
>>> 1935

Those are some pretty big time differences. We saw an over 10,000% performance increase which can really make a difference when doing expensive lookups. Not to mention the memory footprint is much smaller than storing each item in a list. 1.1 million bits is only 0.13 megabytes. Not bad at all if you ask me.

You can grab the source from this post on GitHub.

As always if you have any feedback or questions feel free to drop them in the comments below or contact me privately on my contact page. Thanks for reading!

Tags: Python, Data Structures

Comments:

  • Edgeworth - 3 years ago

    The second example is kind of silly. Why not use a set?

    reply

  • José Ricardo - 3 years ago

    I think that using a set would use more memory.

    reply

  • instagram followers - 7 months ago

    I recently came across your blog and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.

    reply

  • comments - 7 months ago

    This blog is so nice to me. I will keep on coming here again and again. Visit my link as well..

    reply

  • Life Insurance Calgary - 4 months, 3 weeks ago

    I loved as much as you will receive carried out right here. The sketch is attractive, your authored subject matter stylish. nonetheless, you command get bought an edginess over that you wish be delivering the following. unwell unquestionably come further formerly again as exactly the same nearly very often inside case you shield this hike.

    reply

  • Lars - 3 years ago

    The benchmark is flawed. Python's lists are not for item lookup; they're for keeping items in a certain order. I just did some profiling using IPython's %timeit on a larger dictionary, and I found 49ms for your linear search, 16.2ms for Python's built-in linear search (the "in" operator), 109ns for Python's set and 1.96μs for the Bloom filter. I.e., set is >20× faster than your BF. (Of course, that's not entirely fair either because of interpreter overhead in your code.) Also, the (false) positives take more time than the negatives, so you're only measuring the easy case here.

    reply

  • Max Burstein - 3 years ago

    Definitely a valid point. As José pointed out, the main win here would be the memory size. I just did a simple test of adding 100,000 integers to a set and the size was just over 4mb which isn't too bad. You'll definitely run out of memory much quicker by storing large groups of data in a set. Bloom Filters really strike a good balance between memory size and lookup speed.

    reply

  • how to get real instagram followers - 7 months, 2 weeks ago

    I found your this post while searching for information about blog-related research ... It's a good post .. keep posting and updating information.

    reply

  • buy instagram comments fast - 7 months, 2 weeks ago

    it was a wonderful chance to visit this kind of site and I am happy to know. thank you so much for giving us a chance to have this opportunity..

    reply

  • followers twitter - 6 months, 2 weeks ago

    Thank you for helping people get the information they need. Great stuff as usual. Keep up the great work!!!

    reply

  • get likes - 6 months, 2 weeks ago

    Thanks for this great post, i find it very interesting and very well thought out and put together. I look forward to reading your work in the future.

    reply

  • Geoff - 3 years ago

    On a side note, you should be using the "if "apple" in huge" syntax to check if an element is in your list for a more fair comparison. You'll notice a timing improvement, though it won't be anywhere close to your bloom filter. On my machine, I go from ~1500µs to ~700µs.

    reply

  • Heard about Seattle Landscape Architect website - 1 month ago

    A filter for a computer it can improve the performance of the computer to make it faster.

    reply

  • cialis - 10 months, 2 weeks ago

    Hello!

    reply

  • entertainment news - 10 months ago

    Thanks for the blog loaded with so many information. Stopping by your blog helped me to get what I was looking for.

    reply

  • hay day cheats - 9 months, 4 weeks ago

    Superbly written article, if only all bloggers offered the same content as you, the internet would be a far better place..

    reply

  • 8 ball pool hack - 9 months, 4 weeks ago

    Admiring the time and effort you put into your blog and detailed information you offer!..

    reply

  • soundcloud plays buy - 9 months, 1 week ago

    This is just the information I am finding everywhere. Thanks for your blog, I just subscribe your blog. This is a nice blog..

    reply

  • buying soundcloud followers - 9 months, 1 week ago

    This type of message always inspiring and I prefer to read quality content, so happy to find good place to many here in the post, the writing is just great, thanks for the post.

    reply

  • Assignment Help UK - 8 months, 4 weeks ago

    This is a great article thanks for sharing this informative information.

    reply

  • custom essay uk - 8 months, 4 weeks ago

    Your article is really insightful! Looking at your work has enlightened me. Discovered a lot from it. I will take a note of your site and will pursue to go through your forthcoming blog posts. Wonderful! Thanks a lot

    reply

  • buy dissertations - 7 months, 3 weeks ago

    Thank you for this useful tips.

    reply

  • girlsdoporn - 7 months, 2 weeks ago

    I have joined the oracle Trained in Training in Bangalore. Oracle Trainer will be teaching in practically Manner. So I can Easily able to understand the any Concept. This is very helpful for the interviews Purpose . I would refer my frnds to joined this institute.....

    reply

  • Schlüsseldienst Berlin - 7 months, 2 weeks ago

    Stunning! This could be a standout amongst the most helpful web journals we have ever go over on thesubject. Really brilliant information! I'm additionally a specialist in this subject so I can comprehend your exertion.

    reply

  • dissertation online help - 7 months, 2 weeks ago

    Just discovered your online journal and was immediately flabbergasted with all the helpful data that is on it. Extraordinary post

    reply

  • Software Testing Training in Chennai - 7 months, 2 weeks ago

    Very Informative article on Creating a Simple Bloom Filter. Thanks

    reply

  • how to get a lot of followers on instagram - 7 months, 1 week ago

    Thank you very much for this useful article. I like it.

    reply

  • likes on instagram - 7 months, 1 week ago

    Your website is really cool and this is a great inspiring article.

    reply

  • Writing Economics Research Paper - 7 months, 1 week ago

    Hi there, I would like to know What does this line of code means : maxint = (2**(0.size * 8 -2) -1) How did you even know what are the values for maxint?

    reply

  • green smart living - 6 months, 3 weeks ago

    Such intelligent work on the subject and ideal way of writing here. I am really impressed! This post is a helpful overview of the particular topic and very actionable. Interesting approach!

    reply

  • how to get more subscribers on youtube - 6 months, 3 weeks ago

    This is definitely the knowledge We are acquiring all over the place. Cheers for ones web site, I merely join your blog. This is the wonderful web site. .

    reply

  • youtube comment bot - 6 months, 3 weeks ago

    This is actually the content Now i'm searching for anywhere. Regards for use on your web page, I simply subscribe your blog. They can be a excellent web page. .

    reply

  • learn about debt negotiation - 6 months, 2 weeks ago

    Truly! Whatever an eye opener this unique put up happens to be in my circumstances. Substantially relished, saved, I just can’t look for further!

    reply

  • Anonymous - 6 months, 2 weeks ago

    This was a really great contest and hopefully I can attend the next one. It was alot of fun and I really enjoyed myself.. <a href="http://gamehack.org/Boom-Beach-Hack/">boom beach hack</a>|<a href="http://clashofclanstriche.net/">clash of clans</a>

    reply

  • web hosting murah - 6 months, 1 week ago

    WebiiHost telah memberikan layanan web hosting dan registrasi domain di indonesia sejak tahun 2004 dengan harga yang murah. Untuk memenuhi kebutuhan client , kami juga menyediakan layanan reseller hosting, VPS dan Dedicated Server Indonesia.

    reply

  • buyessayanyday.com - 6 months, 1 week ago

    We have every reason to believe that you are right. Thanks for this post.

    reply

  • Anonymous - 6 months, 1 week ago

    Here's a better Wolfram Alpha link:

    http://www.wolframalpha.com/input/?i=%281-%281-1%2Fm%29^%28k*n%29%29^k++for+k%3D7+%2C+n%3D100000%2C+m%3D1100000

    reply

  • this site - 6 months, 1 week ago

    This article gives the light in which we can observe the reality. This is very nice one and gives indepth information. Thanks for this nice article.

    reply

  • IB Maths tutor - 6 months ago

    I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well.

    reply

  • youtube subscribers - 5 months, 3 weeks ago

    I felt very happy while reading this site. This was really very informative site for me. I really liked it. This was really a cordial post. Thanks a lot!.

    reply

  • free youtube views - 5 months, 3 weeks ago

    I really like your take on the issue. I now have a clear idea on what this matter is all about..

    reply

  • Puja Kumari - 5 months, 3 weeks ago

    Thanks for post this helpful post - Please visit for More information about: http://www.top9th.in/packers-and-movers-pune/ http://www.top9th.in/packers-and-movers-noida/ http://www.top9th.in/packers-and-movers-chennai/ http://www.top9th.in/packers-and-movers-kolkata/ http://www.top9th.in/packers-and-movers-ahmedabad/ http://www.top9th.in/packers-and-movers-surat/

    reply

  • Puja Kumari - 5 months, 3 weeks ago

    Thanks for Nice and Informative Post. This article is really contains lot more information about This Topic http://www.top9th.in/packers-and-movers-bangalore/ http://www.top9th.in/packers-and-movers-hyderabad/ http://www.top9th.in/packers-and-movers-mumbai/ http://www.top9th.in/packers-and-movers-delhi/ http://www.top9th.in/packers-and-movers-gurgaon/ http://www.top9th.in/

    reply

  • buy stilnoct no prescription needed - 5 months, 2 weeks ago

    I am so much pleased to get this planning. I am looking ahead to read more about it. This is a very lovely post.

    reply

  • Spermaspender gesucht - 4 months, 4 weeks ago

    Many thanks for the exciting blog posting! Simply put your blog post to my favorite blog list and will look forward for additional updates.

    reply

  • Packers Movers - 4 months, 1 week ago

    Thanks for post this helpful post - Please visit for More information about:

    http://www.expert9th.in/packers-and-movers-delhi/ http://www.expert9th.in/packers-and-movers-noida/ http://www.expert9th.in/packers-and-movers-gurgaon/ http://www.expert9th.in/packers-and-movers-kolkata/ http://www.expert9th.in/packers-and-movers-ghaziabad/ http://www.expert9th.in/packers-and-movers-amritsar/


    Thanks for post this helpful post - Please visit for More information about:

    http://www.expert9th.in/packers-and-movers-bangalore/ http://www.expert9th.in/packers-and-movers-hyderabad/ http://www.expert9th.in/packers-and-movers-chennai/ http://www.expert9th.in/packers-and-movers-pune/ http://www.expert9th.in/packers-and-movers-faridabad/ http://www.expert9th.in/packers-and-movers-kochi/


    Thanks for post this helpful post - Please visit for More information about:

    http://www.expert9th.in/packers-and-movers-mumbai/ http://www.expert9th.in/packers-and-movers-surat/ http://www.expert9th.in/packers-and-movers-navimumbai/ http://www.expert9th.in/packers-and-movers-ahmedabad/ http://www.expert9th.in/packers-and-movers-chandigarh/ http://www.expert9th.in/packers-and-movers-jamshedpur/

    reply

  • Packers Movers - 4 months, 1 week ago

    Thanks for Nice and Informative Post. This article is really contains lot more information about This Topic http://www.top9th.in/packers-and-movers-bangalore/ http://www.top9th.in/packers-and-movers-hyderabad/ http://www.top9th.in/packers-and-movers-mumbai/ http://www.top9th.in/packers-and-movers-delhi/ http://www.top9th.in/packers-and-movers-gurgaon/ http://www.top9th.in/


    Thanks for post this helpful post - Please visit for More information about: http://www.top9th.in/packers-and-movers-pune/ http://www.top9th.in/packers-and-movers-noida/ http://www.top9th.in/packers-and-movers-chennai/ http://www.top9th.in/packers-and-movers-kolkata/ http://www.top9th.in/packers-and-movers-ahmedabad/ http://www.top9th.in/packers-and-movers-surat/

    reply

  •  likesplanet php script - 4 months ago

    I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well. In fact your creative writing abilities has inspired me to start my own BlogEngine blog now. Really the blogging is spreading its wings rapidly. Your write up is a fine example of it.

    reply

  • startup investment platform - 1 month ago

    I really love the quality writing as offered on this post, cheers to the writer.

    reply

  • logico - 1 month, 3 weeks ago

    nice to see that.. really helpfull.. this site is always very helpful for me.surely going to refer to my friends.. you can also find related data on

    reply

  • Sewa Mobil Surabaya - 1 month, 1 week ago

    HAPPY NEW YEAR 2016 HAPPY NEW YEAR 2016 HAPPY NEW YEAR 2016 HAPPY NEW YEAR 2016

    reply

  • happy wheels - 1 month, 1 week ago

    I gathered useful information on this point . Thank you posting relative information and its now becoming easier to complete this assignment

    reply

  • geometry dash - 1 month, 1 week ago

    Useful information.I am actual blessed to read this article.thanks for giving us this advantageous information.I acknowledge this post.and I would like bookmark this post.Thanks

    reply

  • happy wheels - 1 month, 1 week ago

    Wonderful blog! This is very informative site. I am totally pleased by your excellent work. Many thanks for sharing.

    reply

  • happy wheels 2 - 1 month, 1 week ago

    This was among the best posts and episode from your team it let me learn many new things.

    reply

  • hulk - 1 month, 1 week ago

    Really impressive post. I read it whole and going to share it with my social circules. I enjoyed your article and planning to rewrite it on my own blog.

    reply

  • sniper games - 1 month, 1 week ago

    In your blog I was happy to see your article, better than last time, and have made great progress, I am very pleased. I am looking forward to your article will become better and better.

    reply

  • my little pony games - 1 month, 1 week ago

    Thanks for the best blog. it was very useful for me.keep sharing such ideas in the future as well. Thanks for giving me the useful information. I think I need it!

    reply

  • mickey mouse games - 1 month, 1 week ago

    I read your post and I found it amazing! thank!

    reply

  • agario - 1 month, 1 week ago

    I will make sure to bookmark it and come back to learn extra of your helpful info. Thank you for the post. I will definitely comeback.

    reply

  • Heard about Seattle Health Records Scanning - 1 month ago

    Not all filters could work well on your computer, because a wide range of specifications.

    reply

  • hack gems clash of clans . summoners war hack andr - 3 weeks, 5 days ago

    As always if you have any feedback or questions feel free to drop them in the comments below or contact me privately on my contact page. Thanks for reading!

    reply

  • clash of clans hacks - 2 weeks, 5 days ago

    An SEO organization can perform together with a small company to give one more perspective, when it comes to understanding and developing promotion strategies for different sectors and various types of economic sites.

    reply

  • Clash of Clans Free Gems Elixir Gold 2016 - 2 weeks, 2 days ago

    Thanks for this tool 99999 gems

    reply

  • Game of thrones season 6 episode 1 - 6 days, 13 hours ago

    Thanks for this post man.

    reply

  • The Walking Dead Season 6 Episode 10 - 6 days, 13 hours ago

    translation service instantly translates text and web pages. This translator supports: English, Afrikaans, Albanian, Arabic

    reply

  • PSN Code Generator 2016 - 6 days, 13 hours ago

    professional human and machine translations between 75 languages. Translators can also edit paid jobs via our online portal.

    reply

  • Winrar Password Remover 2016 - 6 days, 13 hours ago

    Popular Phrases. On the phone. Money. Shopping. Clothing and ... Translate by Dictionary.com. provides free, online translations for more than 42

    reply

  • How To Get Free Instagram Followers Hack - 6 days, 13 hours ago

    Thansk. kunia panas

    reply

  • Jual Printer - 6 days ago

    Superb blog post. Any place strikes numerous pressing obstacles of your modern culture. People can not be involved that will those obstacles. The place delivers guidelines together with thoughts. Rather interesting together with handy.

    reply

  • You won't believe this Mesothelioma Lawyer Houston - 2 days, 21 hours ago

    This is not my ability in using this program language. But the logic can be convert to another language as we need. http://www.krwlawyers.com/Mesothelioma/Asbestos-Mesothelioma-Lawyer-Houston-TX.php

    reply

  • You won't believe this Seattle SEO Information - 2 days, 2 hours ago

    Any simple mark can make the program error. So it's so important to check every alphabet and marks carefully.

    reply

  • You won't believe this Immigration Lawyers Seattle - 2 days ago

    I like to searching for source code. And I can find many useful functions from here.

    reply