Apply HLL Union with externally provided HLL

Hi! I have a use case where I’m receiving a lot of ids at a very high rate and updating an HLL (with MinHash) key in Aerospike, the problem is it’s causing hot key errors due to the incoming speed of the events.

I’m now looking to generate an in-memory HLL that will then be unioned with the HLL in Aerospike. Apparently, Aerospike uses the HyperMinHash algorithm described here and I’ve found some implementations in go, Java and Python (GitHub - axiomhq/hyperminhash: HyperMinHash: Bringing intersections to HyperLogLog, GitHub - LiveRamp/HyperMinHash-java: Union, intersection, and set cardinality in loglog space, GitHub - mbrg/py-hyperminhash: HyperLogLog with intersection).

Is it possible to know how the HLL (with MinHash) is stored in the byte array Aerospike returns so I can make HLL operations outside Aerospike? Do you provide any example implementation for this?

Thank you!

Since it sounds like you are planning to do some form of coalescing of these writes, have you already tried/considered coalescing by sending many keys to the HLL’s add API?

The implementation of HMH can be found here: https://github.com/aerospike/aerospike-server/blob/b550445b56a0f8d6877d331ee0c0ebd4abbec3b3/as/src/base/particle_hll.c#L1475. Above this line is mostly protocol handling IIRC. The struct is defined here: https://github.com/aerospike/aerospike-server/blob/b550445b56a0f8d6877d331ee0c0ebd4abbec3b3/as/src/base/particle_hll.c#L112. The flags byte is currently unused - it is there in case a feature in introduced into this structure in the future that would be incompatible with the current layout - nothing is currently planned, but I’d recommend that you also verify that the flags byte is 0 to prevent the possibility of operating on an incompatible structure in the future.

You can also write to many different HLLs stored on different records/keys. From there, you have a couple options to get the aggregate count:

  1. You could read all the HLL objects and supply them to the get_union_count API.
  2. You could periodically roll combine and store these HLLs to one of the keys using the set_union. Likely in such a use case, it would also be appropriate to use the refresh_count API to speed up subsequent calls for count to constant time.

Hi @kporter thank you so much for your reply! dawdw

Since it sounds like you are planning to do some form of coalescing of these writes, have you already tried/considered coalescing by sending many keys to the HLL’s add API?

Yes, the issue is it’s so much data that I’m having trouble storing it all in RAM to send all the items to Aerospike at once

The implementation of HMH can be found here: https://github.com/aerospike/aerospike-server/blob/b550445b56a0f8d6877d331ee0c0ebd4abbec3b3/as/src/base/particle_hll.c#L1475 . Above this line is mostly protocol handling IIRC. The struct is defined here: https://github.com/aerospike/aerospike-server/blob/b550445b56a0f8d6877d331ee0c0ebd4abbec3b3/as/src/base/particle_hll.c#L112 . The flags byte is currently unused - it is there in case a feature in introduced into this structure in the future that would be incompatible with the current layout - nothing is currently planned, but I’d recommend that you also verify that the flags byte is 0 to prevent the possibility of operating on an incompatible structure in the future.

This really valuable information, thank you!

You can also write to many different HLLs stored on different records/keys. From there, you have a couple options to get the aggregate count:

This is a great option, I don’t know why I didn’t think of that before. I will try this before going the in-memory HLL route, thank you!