Introduction and Implementation of Bloom Filter
It often happens with many of us that when we create an account on some applications like Github, it shows us that the username already exists. You can add some unique things to your username; only then will it be registered. It happens to all of us.
Now it is time to discuss the logic behind this thing. It is logical that if there is more than one person with the same user name, it will be a very problematic situation for the software to manage things. So, we have to use different usernames, but how does the software find out so fast whether the same username exists or not? Maybe you can think of searching algorithms. If you are familiar with the searching algorithm, then you know that it will be very problematic if we use linear search in this case. We can think of binary search but just think of some applications which have millions of users. It will be very bad for them to take 10 or 20 minutes to warn one user about his user name. So here comes the concept of the bloom filter. Hope you are able to understand the background scenario. Now we will discuss in detail about the bloom filter in the rest of this article. Keep reading to learn more about this topic.
What is a Bloom Filter?
To understand bloom filters, you must first understand hashing. A hash function takes input and returns a unique identifier of a fixed length that can be used to identify the input. An element's membership in a set can be determined using a space-effective probabilistic data structure called a Bloom filter. Efficiency comes at a price because it is probabilistic in nature, which implies that some False Positive outcomes could occur. False positive refers to the possibility of showing that a specific username is already taken when it is not.
Properties of Bloom Filter
- Unlike a traditional hash table, a Bloom filter with a fixed size can represent a set with an arbitrary high number of items.
- Bloom filters never give erroneous negative results, such as claiming a username doesn't exist when it actually does.
- Adding a component is always a smart move. However, the false positive rate increases progressively when more elements are added, and when all of the filter's bits are set to 1, all queries get positive results.
- Since deleting a single element by clearing bits at indices created by k hash functions may result in the deletion of a few additional items, it is not possible to delete elements from the filter. For instance, if we remove "world" by erasing bits at positions 1, 4, and 7, we might also remove "hero" because the bloom filter asserts that "hero" is not present when bit at position 4 is set to 0.
Implementation of Bloom Filter
#include <bits/stdc++.h>
using namespace std;
int hashfun1(string s, int arrSize)
{
long long int hashvar = 0;
for (int i = 0; i < s.size(); i++)
{
hashvar = (hashvar + ((int)s[i]));
hashvar = hashvar % arrSize;
}
return hashvar;
}
int hashfun2(string s, int arrSize)
{
long long int hashvar = 1;
for (int i = 0; i < s.size(); i++)
{
hashvar = hashvar + pow(19, i) * s[i];
hashvar = hashvar % arrSize;
}
return hashvar % arrSize;
}
int hashfun3(string s, int arrSize)
{
long long int hashvar = 7;
for (int i = 0; i < s.size(); i++)
{
hashvar = (hashvar * 31 + s[i]) % arrSize;
}
return hashvar % arrSize;
}
int hashfun4(string s, int arrSize)
{
long long int hashvar = 3;
int p = 7;
for (int i = 0; i < s.size(); i++) {
hashvar += hashvar * 7 + s[0] * pow(p, i);
hashvar = hashvar % arrSize;
}
return hashvar;
}
bool lookup(bool* arraywithbit, int arrSize, string s)
{
int a = hashfun1(s, arrSize);
int b = hashfun2(s, arrSize);
int c = hashfun3(s, arrSize);
int d = hashfun4(s, arrSize);
if (arraywithbit[a] && arraywithbit[b] && arraywithbit[c]
&& arraywithbit[d])
return true;
else
return false;
}
void insert(bool* arraywithbit, int arrSize, string s)
{
if (lookup(arraywithbit, arrSize, s))
cout << s << " is Probably already present" << endl;
else
{
int a = hashfun1(s, arrSize);
int b = hashfun2(s, arrSize);
int c = hashfun3(s, arrSize);
int d = hashfun4(s, arrSize);
arraywithbit[a] = true;
arraywithbit[b] = true;
arraywithbit[c] = true;
arraywithbit[d] = true;
cout << s << " inserted" << endl;
}
}
int main()
{
bool arraywithbit[100] = { false };
int arrSize = 20;
string usname[8]
= { "eat","tea","tan","ate","nat","tea","tan","bat"
};
for (int i = 0; i < 8; i++) {
insert(arraywithbit, arrSize, usname[i]);
}
return 0;
}
Output:
eat inserted
tea inserted
tan inserted
ate inserted
nat inserted
tea is Probably already present
tan is Probably already present
bat inserted