Get Aerospike hyperLogLog(HLL) intersection count of multiple HLL unions

Also posted to StackOverflow

I have 2 or more HLLs that are unioned, I want to get the intersection count of that unions. I have used the example from here hll-python example Following is my code

ops = [hll_ops.hll_get_union(HLL_BIN, records)]
_, _, result1 = client.operate(getKey(value), ops)

ops = [hll_ops.hll_get_union(HLL_BIN, records2)]
_, _, result2 = client.operate(getKey(value2), ops)

ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [result1[HLL_BIN]] + [result2[HLL_BIN]])]
_, _, resultVal = client.operate(getKey(value), ops)
print(f'intersectAll={resultVal}')
_, _, resultVal2 = client.operate(getKey(value2), ops)
print(f'intersectAll={resultVal2}')

I get 2 different results when I use different keys for the intersection using hll_get_intersect_count, i.e resultVal and resultVal2 are not same. This does not happen in the case of union count using function hll_get_union_count. Ideally the value of intersection should be the same.
Can any one tell me why is this happening and what is the right way to do it?

I’m not spotting a reason in the server code for why rearranging 3 HLLs in an intersect call should produce different results.

Other than re-arranging the HLLs in the intersect call, did you make any other modifications to the example?

I’ve attempted to reproduce this issue without success. The order in which HLLs appear in a intersection doesn’t seem to matter in my tests.

Well, I had noticed but overlooked the significance of using an individual key in your intersections.

You are basically intersecting something like

INTERSECT(
    HLL0,
    UNION(HLL0, HLL1, HLL2),
    UNION(HLL3, HLL4, HLL5

Which can be at most COUNT(HLL0)

and

INTERSECT(
    HLL1,
    UNION(HLL0, HLL1, HLL2),
    UNION(HLL3, HLL4, HLL5))

Which can be at most COUNT(HLL1)

These two operations are different.

I assume what you actually want to do is `INTERSECT( UNION0, UNION1 ) - I suggest using an operation-expression (specifically an expression_read() in python) for this task.

Either a pilot error or the python client seems to have an issue with passing an hll as an binexpr - so the expression is a bit more complicated than it needed to be.

Working example can be found here, below is modified for this particular application.

    ops = [
        expops.expression_read(
            "intersect",
            # Why does the python client fail here?
            # exp.hll.HLLGetIntersectCount(
            #     [value], value2
            # ).compile()
            exp.hll.HLLGetIntersectCount(
                [value],
                exp.HLLGetUnion(
                    [value2]],
                    exp.HLLInit(
                        None, HLL_INDEX_BITS, 0, HLL_BIN
                    )
                )
            ).compile()
        )
    ]

    val = client.operate(getkey(tags[0], month), ops)[2]["intersect"]
    print(f"val:{val}")

Not sure if it matters for repeatability, but the link to python example uses 0 minHash bits - not the best choice for intersection count.

True, guess I should have also suggested that - the description implies 0 minHashBits and I was more interested in reproducing the described issue.

Was able to figure out the solutions (with the help of Aerospike support) Intersection of HLLs is not supported in Aerospike. However, If I am to get intersection of multiple HLLs I will have to save one union into aerospike and then get intersection count of one vs the rest of the union. The key we provide in client.operate function for hll_get_intersect_count is used to get the intersection with the union Following is the code I came up with

ops = [hll_ops.hll_get_union(HLL_BIN, records)]
_, _, result1 = client.operate(getKey(value), ops)

ops = [hll_ops.hll_get_union(HLL_BIN, records2)]
_, _, result2 = client.operate(getKey(value2), ops)

# init HLL bucket
ops = [hll_ops.hll_init(HLL_BIN, NUM_INDEX_BITS, NUM_MH_BITS)]
_, _, _ = client.operate(getKey('dummy'), ops)
# get set union and insert to inited hll bucket and save it for 5 mins(300 sec)
# use hll_set_union to save the HLL into aeropike temporarily
ops = [hll_ops.hll_set_union(HLL_BIN, records2)]
_, _, _ = client.operate(getKey('dummy'), ops, meta={"ttl": 300})

ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [result1[HLL_BIN]])]
_, _, resultVal = client.operate(getKey('dummy'), ops)
print(f'intersectAll={resultVal}')

For more reference, you can look here for hll_set_union reference

To summarize, I want to understand your issue. You have 3 HLLs - A, B and C and you want to find their intersection count. count of ( A int B int C)

Aerospike lets you make a new HLL: D = A U B. (Operation getUnion()) Aerospike also lets you find intersection count of any two HLLs. getIntersectCount(), (updated -->) up to 3 HLLs if minhash bits are zero, more than 3 if minHash bits are set. (4 min, 51 max if specified, 0 default)

But (A intersection B intersection C), is not same as: (( A U B) intersection C.) … not sure I understand how you solved your problem.

IMG_7636

HLL intersections are definitely supported for up to 3 HLLs. There is quite a bit of error in HLL intersection estimates. If you require more accuracy then you can trade additional space for that accuracy by using the minhash bits which makes a HyperMinHash (HMH). Additionally, unlike HLL, HMH intersections are not limited to 3 HMHs.

The example was created before read expressions existed. Today, instead of creating/deleting a bin, I would use a read expression, as I demonstrated above. Unfortunately, due to an issue with the python client, currently it is more complicated than it should be.

Hi @pgupta,
I have around 11 groups (represented by A, B, C, etc) each group is an union of multiple HLLs(represented by A1, B1, C1, etc)

A = A1 ∪ A2 ∪ A3 ∪ ...
B = B1 ∪ B2 ∪ B3 ∪ ...
C = C1 ∪ C2 ∪ C3 ∪ ...
.
.
.

what I need is R = A ∩ B ∩ C ∩ ....
Hence what I did was store A in aerospike as ‘dummy’ and get union of B ∪ C ∪ D ∪ ... this is the result1 value in the code above

Finally, I get the answer by getting an intersection count of dummy vs rest, also I needed a ballpark figure not an exact estimate hence this seems pretty good estimate to me

@kporter I found expressions pretty complicated to understand hence, I decided to not use them

I thought you should do: getIntersectCount( A, List[B, C, D ...]) – where A is put in an HLL bin, and B, C, D are HLL values in a List. (Assuming B, C, D are not built with minHashBits = 0). The B u C u D … even if it works, I wonder if its the correct answer if you do getIntersectCount( A, List[ (B u C u D) ] ) …i.e. list with one item.

Yeah I can try that, but how is one different than the other

If minHashBits are 0, then List[B, C] … can have max two items. How is it different … see my sketch above. C int (A U B) … gives the larger red shaded area vs C int A int B gives smaller red area.

I have minHash 5

Great … then you can have as many items in the List as you want. (There will be some message limits in MiBs… theoretically … that you will never hit.)

Just so that I understand you better I should do the following?

ops = [hll_ops.hll_get_intersect_count(HLL_BIN, [B1,B2,B3,...C1,C2,C3,...])]
_, _, resultVal = client.operate(getKey('A'), ops)

wouldn’t HLL Union do the same instead of just specifying them in the list?

That depends on what you want. If you really want A int A1 int A2 int B int B1 int B2 …then yes, each item should be in the list.

May be this figure for A int B int C … will help. This is basically what the server computes underneath. IMG_7637

When you do (A U B) int C ==> you get 10, 5, 6, 9 – count = 4. When you do A int B int C ==> you get 5, 6 – count = 2.

What I really want is A int B int C…(i.e. intersection of groups rather than candidate HLLs)

Well if thats what the problem needs, then treat each group union, as a list item. So then yeah, (A, List [B, C, D, …]) where A = A1 u A2 u A3 … and B = B1 u B2 u B3 … and so on. :+1: