Bloom Filter (System Design)

Suppose you are building the signup page for GMAIL like service and you wanted to make sure that the username is unique, for that you must be checking if the username does not exist already in the database (Username is not taken already) then only you will assign that username to the user. But seeing Billion Daily Active Users for Gmail, your solution to this problem should be scalable enough to handle search in this very large dataset of user accounts.

Let's go through all possible solutions one by one -

Linear Search: A brute force solution will be firing a SQL query to check if that username already exists or not, this will result in O(n) time complexity.
Binary Search: We can store the username in alphabetical order and do a binary search. Although we have improved the search operation to O(log(n)) but now maintaining sorted order of usernames is expensive.
Using in-memory SET: We can store each username in a set and check in O(1) time if the username is in the set. But storing these many usernames in memory is challenging and not cost-efficient.

Bloom Filter

Bloom Filter is a space-efficient probabilistic data structure that is used to search an element within a very large set of elements in constant time that is O(k) where K is the number of hash functions being used in Bloom Filter.

Bloom Filter uses a bit array and hashing techniques to store the existence of a string. Suppose we have an array of length 10^15 and K=3 Hash functions

No alt text provided for this image

BIT Array and K=3 Hash Functions

Insertion in Bloom Filter -

Take user input
Pass input through each hash function
Take mod of output from BIT Array Length, you will have 3 array index
Turn on the BIT to 1

No alt text provided for this image

Insertion in Bloom Filter

Time Complexity - Constant Time O(k), where k is the number of hash functions

Key Lookup in Bloom Filter -

Take user input
Pass input through each hash function
Take mod of output from BIT Array Length, you will have 3 array index
Check if all 3 index in the BIT array is 1
If any BIT is 0 then the Key do not exist

No alt text provided for this image

Key Lookup in Bloom Filter

Time Complexity - Constant Time O(k), where k is the number of hash functions

Since it is a probabilistic data structure, Key Lookup output falls into two categories -

False Positive: It is possible to get a false positive which means even if the key does not exist, it might return that key is already there in the set.
False Negative: It will never return a false negative which means it returns accurate results if the key do not exist.

Time complexity

Operation	Time complexity
add item	O(k) or constant
membership query	O(k) or constant

where k is the number of hash functions.

The time complexity of the bloom filter is independent of the number of items already in the bloom filter. The k lookups in the bloom filter are independent and can be parallelized.

Space complexity

Regardless of the number of items in the bloom filter, the bloom filter requires a constant number of bits by reserving a few bits per item. The bloom filter does not store the data items yielding a constant space complexity of O(1).

Bloom filter calculator

The hur.st bloom filter calculator can be used to choose an optimal size for the bloom filter and explore how the different parameters interact. The accuracy of the bloom filter depends on the following:

number of hash functions (k)
quality of hash functions
length of the bit array (n)
number of items stored in the bloom filter

The properties of an optimal hash function for the bloom filter are the following:

fast
independent
uniformly distributed

Advantages of bloom filter

The advantages of the bloom filter are the following:

constant time complexity
constant space complexity
operations are parallelizable
no false negatives
enables privacy by not storing actual items

Bloom filter disadvantages

The limitations of the bloom filter are the following:

bloom filter doesn’t support the delete operation
false positives rate can’t be reduced to zero
bloom filter on disk requires random access due to random indices generated by hash functions

Removing an item from the bloom filter is not supported because there is no possibility to identify the bits that should be cleared. There might be other items that map onto the same bits and clearing any of the bits would introduce the possibility of false negatives.