How to calculate HLL n_minhash_bits and n_index_bits

dragondgold · March 4, 2023, 1:21am

Hi, I would like to know how exactly I should calculate how many bits I need for HyperMinHash for some given error rate. The documentation says:

What are “target error of similarity estimate” and “minimum similarity of interest”? When n_minhash_bits = 0 the calculation is easy but in my case, I will be doing intersections and need to control the maximum error so I need n_minhash_bits > 0. For example, If I need to be 99% confident that a 1% overlap will be reported with a maximum error of 0.1%, what values should e and t have?

kporter · March 7, 2023, 12:24am

If you need 99% confidence then let the error (e) be 0.01. Since you are interested in intersections with 1% overlap then let the minimum similarity (t) be 0.01.

ceil(log2(6/(0.01*0.01))) = 16 minhash_bits
ceil(log2(0.01^-2)) = 14 index_bits

The size of this HMH can be computed as

ceil(((6 + n_minhash_bits) * (2 ^ n_index_bits)) / 8)
Or 45056 bytes.

dragondgold · March 7, 2023, 12:52am

Thank you for your reply @kporter! e would be the confidence or the max error? I’m asking because I’m testing with a simple case, I’m using 16 index_bits and 31 minhash_bits.

Load 7092567 unique strings into HLL A
Load 35629265 unique strings into HLL B
Intersect HLL A with HLL B, the intersection should be 825928, and Aerospike returns 839050, that’s a 1.59% error I was expecting less with the number of bits I’m using or I’m misunderstanding something?

On some other intersections, I get errors above 3% and the more HLLs intersect, the bigger the errors get (that’s expected), is there any way to design for a max error?

Thank you!

kporter · March 7, 2023, 1:23am

From the paper, the above is referred to as the ‘relative error’.

dragondgold · March 7, 2023, 2:30am

Do you know relative to what is the “relative error” measured? I can’t quite understand the paper, it mentions:

First we note that trivially, HyperLogLog union cardinalities can be used to compute intersection cardinalities and thus Jaccard index using the inclusion-exclusion principle. Unfortunately, the relative error is then in the size of the union > (as opposed to the size of the Jaccard index for MinHash)

So for HyperMinHash the relative error is not relative to the union of both sets, is it relative to the intersection size? I’m not sure what would be the size of the Jaccard Index

pgupta · March 7, 2023, 3:58am

Here are a few slides on setting these bits. Perhaps this helps. You have to have some idea of how similar your two HLLs are, to select the the desired minHash bits, that give desired error estimate in intersection set cardinality. The more similar the sets are, the lower the minHash bits you can get away with for same estimated error in intersection count estimate.

dragondgold · March 8, 2023, 1:45am

@pgupta that explanation was great!! This should be on the HLL data type page (HyperLogLog Data Type and Probabilistic Data | Aerospike Documentation), thank you!

pgupta · March 8, 2023, 5:57pm

I tried to replicate your test. I find that there is no specific trend with increasing minHash bits beyond the good estimated value. Also, the error changes based on what your unique strings are in the original data.

Here are my test results - closely approximating your test parameters:

hll_a: 7096320 unique (string) values
hll_b: 35635200 unique (string) values
hll_a INT hll_b = 829440  (actual)

Unique strings were like this: america1british1 america1british2 … etc. generated in loop.

indexbits	minhashbits	intersection cardinality est	error
16	14	815280	1.71%
16	16	822495	0.84%
16	18	828294	0.14%
16	20	831501	-0.25%
16	24	835978	-0.79%
16	30	816166	1.60%
16	32	830866	-0.17%
16	34	816166	1.60%
16	36	809774	2.37%

However, if I change my “unique” string data, make them shorter… unique string: amer1brit1, amer1brit2 … and so on. I get:

16 18 843530 -1.70% (which was earlier 0.14%) …

dragondgold · March 8, 2023, 6:12pm

@pgupta I tested with 15 index_bits and 49 minhash_bits, this way the HLL is smaller and I get a better error when doing multiple intersections.

The input to the HLLs are UUID strings, on some intersections of two HLLs I get below 1-2% error while on others I’m at around 10%, the same when doing 4 or 5 intersections, it obviously depends on the intersections size and the size of each HLL being intersected.

Testing with another HLL+MinHash (not HyperMinHash) implementation (with a size of 900KB) I get all intersection errors below 5%, obviously, it’s much larger. I don’t know if the limits on minhash_bits and index_bits+minhash_bits are Aerospike or algorithm specific but it would be nice to be able to use more minhash_bits for my use case.

pgupta · March 8, 2023, 6:17pm

I repeated the shorter unique string version of the test with your settings: 15 / 49 … gets worse for me. So looks like highly original data dependent too.

16	18	843530	-1.70%	(Shorter unique string values)
15	49	877554	-5.80%	(Shorter unique string values)

and just to complete the argument …

15 49 800612 3.48% (Original -longer- unique string values)

BTW, my error sign is reversed … I did: (actual - est)/actual.

dragondgold · March 8, 2023, 6:24pm

Yes, the hashing algorithm also has a high impact. When using the HLL+MinHash implementation I was talking about, changing between Murmur3, XXH3, and MD5 drastically changes the error at least in my case, MD5 is the best one for me.

I’m sure the answer is not but, is there any way to add more minhash_bits over 51 and over 64 bits of minhash_bits+index_bits?

pgupta · March 8, 2023, 6:25pm

I believe we use murmur3 and 51 is max with 64 sum max… @kporter am i right?

kporter · April 22, 2023, 3:58pm

I’ll take being able to “increase the number of minhash bits beyond 51” as a feature request. How many bits do you think you would need? Would hll-bits + minhash-bits <= 128 be enough?

@pgupta is correct, we use murmur3_x64_128 as the hashing function. The first 64 bits hold the index bits and minhash bits, the second 64 bits used the calculate the value for the hll bucket.

I’d be curious how wyhash would rank - we’ve started using it internally (not for HLL of course), it is significantly faster than the previous hashing algorithms that we were using and it seems to be a decent hash.

dragondgold · April 29, 2023, 3:33pm

Yes, IMHO I think that would help in cases where there are lots of intersections. As a side note, I think adding the Thetasketch Framework could be even more useful, this allows not only getting the resultant HLL for unions, but also for intersections and ANotB operations, it’s extremely powerful. I’ve successfully ported it from Java to Go and it works great.

I will make some time to test it if you are interested!

system · April 28, 2024, 3:34pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HyperLogLog intersection accuracy Feature Discussion	2	1486	May 2, 2020
Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions Python Client python	21	1070	February 10, 2023
Apply HLL Union with externally provided HLL go , java , counter	4	541	January 14, 2024
HyperLogLog hash function	2	832	March 19, 2021
Find minimum of secondary index bin	0	573	May 31, 2019

How to calculate HLL n_minhash_bits and n_index_bits

Related topics