Zunayed Ali Morsalin home about travel resume

week 5 - Bloom filters and you

Jul 19, 2017

Use cases

Bloom filters are a probabilistic data structure that allows you to check for set membership. Figuring out if a item is in set is a 0(1) operation. You might be thinking big deal I can do that with a hashtable but with a hash table you’re space complexity is 0(n) and will grow quite massive for large lists. In a bloom filter you are storing hashes of your values as bits in a bitarray. You don’t actually store the data itself only information that will tell you if the item is in your list. Some use cases are a spell checker or a banned ip list. In fact chrome at one pointed shipped a bloom filter that would check if the website you’re trying to visit is a malicious site. A bloom filter storing info on a 100,000 sites with a .001 probability of false positives will only take up 1,437,759 bits (175.51 KB). Such a small footprint means you can ship the entire bitarray to the client to do client side checks and avoid server processing.

Python example

Here’s my attempt at recreating a bloom filter in python

import math
import hashlib

import bitarray
import mmh3


def calc_optimal_hash_func(lenght_of_entries):
    m = (-lenght_of_entries * math.log(0.01)) / ((math.log(2)) ** 2)
    k = (m / lenght_of_entries) * math.log(2)

    return int(m), int(math.ceil(k))


def lookup(string, bit_array, seeds):
    for seed in range(seeds):
        result = mmh3.hash(string, seed) % len(bit_array)
        # print "seed", seed, "pos", result, "->", bit_array[result]
        if bit_array[result] == False:
            return string, "Def not in dictionary"

    return string, "Probably in here"


def load_words():
    data        = []
    word_loc    = '/usr/share/dict/words'

    with open(word_loc, 'r') as f:
        for word in f.readlines():
            data.append(word.strip())

    return data


def get_bit_array():
    words               = load_words()
    w_length            = len(words)
    array_length, seeds  = calc_optimal_hash_func(w_length)
    bit_array           = array_length * bitarray.bitarray('0')

    for word in words:
        try:
            for seed in range(seeds):
                # print "word", word, "seed", seed, "pos", result, "->", bit_array[result]
                pos = mmh3.hash(word, seed) % array_length
                bit_array[pos] = True
        except:
            pass

    return seeds, bit_array

if __name__ == '__main__':
    seeds, ba = get_bit_array()
    print(lookup('badwordforsure', ba, seeds))
    print(lookup('cat', ba, seeds))
    print(lookup('hello', ba, seeds))
    print(lookup('jsalj', ba, seeds))