The following introduces several applications that you may not encounter often.
simHash is a method used by Google to deduplicate massive texts. It is a local sensitive hash. What is local sensitivity? 해시게임It is assumed that two strings have a certain similarity. After hashing, the similarity can still be maintained, which is called a local sensitive hash. Ordinary hashes do not have this property. smash is used by Google to deduplicate in massive text.
The idea of the simHash algorithm is roughly as follows:
Perform keyword extraction on Doc (including word segmentation and weight calculation), and extract n pairs of (keyword, weight), that is, multiple (feature, weight) in the figure. Denoted as feature_weight_pairs = [fw1, fw2 … fwn], where fwn = (feature_n, weight_n).
Hash the features in each feature_weight_pairs. Then carry out the vertical accumulation of the hash_weight_pairs, if the bit is 1, then +weight, if it is 0, then -weight, and finally generate bits_count numbers, greater than 0 marks 1, less than 0 marks 0
Finally, it is converted into a 64-bit byte. To judge the repetition, it is only necessary to judge whether the distance between their characteristic words is <n (n is generally 3 according to experience), and then it can be judged whether the two documents are similar.
SimHash calculation process
As shown in the figure below, when only one word of the two texts changes, if ordinary Hash is used, the results of the two times will change greatly, while the local sensitivity of SimHash will cause only part of the data to change.
GeoHash recursively decomposes the Earth as a two-dimensional plane. Each decomposed sub-block has the same code within a certain range of latitude and longitude. Take the following figure as an example. All points (latitude and longitude coordinates) in this rectangular area share the same GeoHash string, which not only protects privacy (represents only the approximate location of the region instead of specific points) but also makes it easier to cache.
Let’s take an example to understand this algorithm. We approximate the latitude 39.3817 and encode it:
The latitude interval of the earth is [-90,90], and this interval is divided into left interval [-90,0) and right interval [0,90]. 39.3817 belongs to the right interval, marked as 1
Continue to divide the right interval [0,90], the left interval [0,45), and the right interval [45,90]. 39.3817 belongs to the left interval, marked as 0
Recursively the above process, with each iteration, the interval [a, b] will continue to approach 39.3817. The number of recursions determines the length of the resulting sequence.
Do the same for longitude. In the obtained string, the even digits are longitude, the odd digits are latitude, and the two strings of codes are combined to generate a new string. For the new string converted to the corresponding decimal, the actual base32 encoding is a hash value similar to WX4ER.
5.3 Bloom filter
Bloom filters are widely used in blacklist filtering, spam filtering, crawler weighting systems, and cache penetration problems. For small numbers and large enough memory, we can directly use hashMap or HashSet to meet the needs of this activity. However, if the amount of data is very large, for example, a 5TB hard disk is full of user participation data, an algorithm is needed to deduplicate the data and obtain the number of deduplicated users participating in the activity. In this case, the Bloom filter is a better solution.
Bloom filter is actually an application based on a bitmap, proposed by Bloom in 1970. It’s actually a long binary vector and a series of random mapping functions to retrieve whether an element is in a set. Its advantage is that the space efficiency and query time are far more than the general algorithm, and the disadvantage is that it has a certain false recognition rate and deletion difficulty. It is mainly used for big data deduplication, spam filtering, and crawler URL records. The core idea is to use one bit to store multiple elements, in this way to reduce memory consumption. Through multiple hash functions, multiple values are calculated for each data and stored in the corresponding positions in the bitmap.
The principle of bloom filter
In the example, after the data a, b, and c are hashed three times, the corresponding bits are all 1, indicating that these three data already exist. After the data of d is mapped, a result is 0, which indicates that the data of d must not have appeared. Bloom filter has the problem of false positive rate (determining that elements that exist may not exist), but not the problem of false negative rate (determining that the reason for not existing may exist). That is, for the data e, the results of the three mappings are all 1, but this data may not have appeared.
where p is the false positive rate, n is the element to accommodate, and m is the required storage space. It can be seen from the announcement that the length of the Bloom filter will directly affect the false positive rate. The longer the Bloom filter, the smaller the false positive rate. The number of hash functions also needs to be weighed. The more the number, the faster the bloom filter bit is set to 1, and the lower the efficiency of the bloom filter; but if it is too small, it will lead to false positives. rise.