Bloom Filters & Probabilistic Data Structures

Imagine you run a URL shortener like bit.ly with 10 billion shortened URLs. When a user creates a new short URL, you need to check if it already exists. Querying a database of 10 billion records for every single creation request is prohibitively slow. A Bloom Filter can answer "Does this URL exist?" in constant time, using a tiny fraction of the memory a full database index would require.

1. What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It can tell you with certainty:

"Definitely NOT in the set" (no false negatives).
"Probably in the set" (possible false positives).

It NEVER produces false negatives. If the Bloom Filter says an element is not in the set, you can trust it 100%.

2. How It Works

Initialize: Create a bit array of size $m$, all set to 0. Choose $k$ independent hash functions.
Insert (element $x$): Compute $k$ hash values: $h_1(x), h_2(x), ..., h_k(x)$. Set the bits at positions $h_1(x) \mod m$, $h_2(x) \mod m$, ..., $h_k(x) \mod m$ to 1.
Query (element $y$): Compute $k$ hash values for $y$. Check the bits at all $k$ positions.
- If ANY bit is 0: The element is definitely NOT in the set.
- If ALL bits are 1: The element is probably in the set. (Other elements might have set those same bits.)

3. Trade-offs

False Positive Rate: As more elements are added, more bits are set to 1, increasing the probability that a query for a non-existent element finds all 1s by coincidence. The false positive rate can be tuned by adjusting $m$ (bit array size) and $k$ (number of hash functions).
No Deletion: Standard Bloom Filters do not support deletion (setting a bit to 0 could affect other elements). Counting Bloom Filters replace each bit with a counter to support deletion.

4. Real-World Use Cases

Google Chrome: Uses a Bloom Filter to check if a URL is in its list of known malicious websites. If the Bloom Filter says "probably yes", Chrome then queries a remote server for confirmation.
Apache Cassandra: Uses Bloom Filters to avoid unnecessary disk reads. Before reading a file (SSTable) from disk, it checks the Bloom Filter. If the key is "definitely not" in that file, the disk read is skipped entirely.
Medium: Uses Bloom Filters to avoid recommending articles a user has already read.
Akamai CDN: Uses Bloom Filters to avoid caching one-hit-wonder content. An object is only cached after the second request.

Bloom Filters & Probabilistic Data Structures

1. What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It can tell you with certainty:

"Definitely NOT in the set" (no false negatives).

"Probably in the set" (possible false positives).

It NEVER produces false negatives. If the Bloom Filter says an element is not in the set, you can trust it 100%.

2. How It Works

Initialize: Create a bit array of size $m$, all set to 0. Choose $k$ independent hash functions.

Insert (element $x$): Compute $k$ hash values: $h_1(x), h_2(x), ..., h_k(x)$. Set the bits at positions $h_1(x) \mod m$, $h_2(x) \mod m$, ..., $h_k(x) \mod m$ to 1.

Query (element $y$): Compute $k$ hash values for $y$. Check the bits at all $k$ positions.

If ANY bit is 0: The element is definitely NOT in the set.
If ALL bits are 1: The element is probably in the set. (Other elements might have set those same bits.)

3. Trade-offs

False Positive Rate: As more elements are added, more bits are set to 1, increasing the probability that a query for a non-existent element finds all 1s by coincidence. The false positive rate can be tuned by adjusting $m$ (bit array size) and $k$ (number of hash functions).

No Deletion: Standard Bloom Filters do not support deletion (setting a bit to 0 could affect other elements). Counting Bloom Filters replace each bit with a counter to support deletion.

4. Real-World Use Cases

Google Chrome: Uses a Bloom Filter to check if a URL is in its list of known malicious websites. If the Bloom Filter says "probably yes", Chrome then queries a remote server for confirmation.

Apache Cassandra: Uses Bloom Filters to avoid unnecessary disk reads. Before reading a file (SSTable) from disk, it checks the Bloom Filter. If the key is "definitely not" in that file, the disk read is skipped entirely.

Medium: Uses Bloom Filters to avoid recommending articles a user has already read.

Akamai CDN: Uses Bloom Filters to avoid caching one-hit-wonder content. An object is only cached after the second request.