What are “target error of similarity estimate” and “minimum similarity of interest”? When n_minhash_bits = 0 the calculation is easy but in my case, I will be doing intersections and need to control the maximum error so I need n_minhash_bits > 0. For example, If I need to be 99% confident that a 1% overlap will be reported with a maximum error of 0.1%, what values should e and t have?
Thank you for your reply @kporter! e would be the confidence or the max error? I’m asking because I’m testing with a simple case, I’m using 16 index_bits and 31 minhash_bits.
Load 7092567 unique strings into HLL A
Load 35629265 unique strings into HLL B
Intersect HLL A with HLL B, the intersection should be 825928, and Aerospike returns 839050, that’s a 1.59% error I was expecting less with the number of bits I’m using or I’m misunderstanding something?
On some other intersections, I get errors above 3% and the more HLLs intersect, the bigger the errors get (that’s expected), is there any way to design for a max error?
Do you know relative to what is the “relative error” measured? I can’t quite understand the paper, it mentions:
First we note that trivially, HyperLogLog union cardinalities
can be used to compute intersection cardinalities and thus
Jaccard index using the inclusion-exclusion principle. Unfortunately, the relative error is then in the size of the union> (as opposed to the size of the Jaccard index for MinHash)
So for HyperMinHash the relative error is not relative to the union of both sets, is it relative to the intersection size? I’m not sure what would be the size of the Jaccard Index
Here are a few slides on setting these bits. Perhaps this helps. You have to have some idea of how similar your two HLLs are, to select the the desired minHash bits, that give desired error estimate in intersection set cardinality. The more similar the sets are, the lower the minHash bits you can get away with for same estimated error in intersection count estimate.
I tried to replicate your test. I find that there is no specific trend with increasing minHash bits beyond the good estimated value. Also, the error changes based on what your unique strings are in the original data.
Here are my test results - closely approximating your test parameters:
@pgupta I tested with 15 index_bits and 49 minhash_bits, this way the HLL is smaller and I get a better error when doing multiple intersections.
The input to the HLLs are UUID strings, on some intersections of two HLLs I get below 1-2% error while on others I’m at around 10%, the same when doing 4 or 5 intersections, it obviously depends on the intersections size and the size of each HLL being intersected.
Testing with another HLL+MinHash (not HyperMinHash) implementation (with a size of 900KB) I get all intersection errors below 5%, obviously, it’s much larger. I don’t know if the limits on minhash_bits and index_bits+minhash_bits are Aerospike or algorithm specific but it would be nice to be able to use more minhash_bits for my use case.
Yes, the hashing algorithm also has a high impact. When using the HLL+MinHash implementation I was talking about, changing between Murmur3, XXH3, and MD5 drastically changes the error at least in my case, MD5 is the best one for me.
I’m sure the answer is not but, is there any way to add more minhash_bits over 51 and over 64 bits of minhash_bits+index_bits?