What are “target error of similarity estimate” and “minimum similarity of interest”? When n_minhash_bits = 0 the calculation is easy but in my case, I will be doing intersections and need to control the maximum error so I need n_minhash_bits > 0. For example, If I need to be 99% confident that a 1% overlap will be reported with a maximum error of 0.1%, what values should e and t have?
If you need 99% confidence then let the error (e) be 0.01. Since you are interested in intersections with 1% overlap then let the minimum similarity (t) be 0.01.
Thank you for your reply @kporter! e would be the confidence or the max error? I’m asking because I’m testing with a simple case, I’m using 16 index_bits and 31 minhash_bits.
Load 7092567 unique strings into HLL A
Load 35629265 unique strings into HLL B
Intersect HLL A with HLL B, the intersection should be 825928, and Aerospike returns 839050, that’s a 1.59% error I was expecting less with the number of bits I’m using or I’m misunderstanding something?
On some other intersections, I get errors above 3% and the more HLLs intersect, the bigger the errors get (that’s expected), is there any way to design for a max error?
Do you know relative to what is the “relative error” measured? I can’t quite understand the paper, it mentions:
First we note that trivially, HyperLogLog union cardinalities
can be used to compute intersection cardinalities and thus
Jaccard index using the inclusion-exclusion principle. Unfortunately, the relative error is then in the size of the union> (as opposed to the size of the Jaccard index for MinHash)
So for HyperMinHash the relative error is not relative to the union of both sets, is it relative to the intersection size? I’m not sure what would be the size of the Jaccard Index
Here are a few slides on setting these bits. Perhaps this helps. You have to have some idea of how similar your two HLLs are, to select the the desired minHash bits, that give desired error estimate in intersection set cardinality. The more similar the sets are, the lower the minHash bits you can get away with for same estimated error in intersection count estimate.
I tried to replicate your test. I find that there is no specific trend with increasing minHash bits beyond the good estimated value. Also, the error changes based on what your unique strings are in the original data.
Here are my test results - closely approximating your test parameters:
@pgupta I tested with 15 index_bits and 49 minhash_bits, this way the HLL is smaller and I get a better error when doing multiple intersections.
The input to the HLLs are UUID strings, on some intersections of two HLLs I get below 1-2% error while on others I’m at around 10%, the same when doing 4 or 5 intersections, it obviously depends on the intersections size and the size of each HLL being intersected.
Testing with another HLL+MinHash (not HyperMinHash) implementation (with a size of 900KB) I get all intersection errors below 5%, obviously, it’s much larger. I don’t know if the limits on minhash_bits and index_bits+minhash_bits are Aerospike or algorithm specific but it would be nice to be able to use more minhash_bits for my use case.
I repeated the shorter unique string version of the test with your settings: 15 / 49 … gets worse for me. So looks like highly original data dependent too.
Yes, the hashing algorithm also has a high impact. When using the HLL+MinHash implementation I was talking about, changing between Murmur3, XXH3, and MD5 drastically changes the error at least in my case, MD5 is the best one for me.
I’m sure the answer is not but, is there any way to add more minhash_bits over 51 and over 64 bits of minhash_bits+index_bits?
I’ll take being able to “increase the number of minhash bits beyond 51” as a feature request. How many bits do you think you would need? Would hll-bits + minhash-bits <= 128 be enough?
@pgupta is correct, we use murmur3_x64_128 as the hashing function. The first 64 bits hold the index bits and minhash bits, the second 64 bits used the calculate the value for the hll bucket.
I’d be curious how wyhash would rank - we’ve started using it internally (not for HLL of course), it is significantly faster than the previous hashing algorithms that we were using and it seems to be a decent hash.
Yes, IMHO I think that would help in cases where there are lots of intersections. As a side note, I think adding the Thetasketch Framework could be even more useful, this allows not only getting the resultant HLL for unions, but also for intersections and ANotB operations, it’s extremely powerful. I’ve successfully ported it from Java to Go and it works great.
I will make some time to test it if you are interested!