Union of many HLLs without retrieving them

I’m currently making unions of many HLL structures (around 200) of 200KB using the get_union method where I have to specify a key and a list of HLL structures I want to union.

The main issue is that because I have to give a list of HLLs to get_union I have to retrieve a lot of big HLLs from Aerospike which saturates the 10Gbps network connection from the server, making the process slow. Is there any way to make HLL unions from multiple records without retrieving them from Aerospike? I couldn’t find anything in the docs.

Thank you!

Like you want to tell the Aerospike cluster … hey, here is a list of records and their HLL bins … make a union of all these and put it in this other record?

Exactly that! So I don’t have to retrieve the HLL structures from Aerospike

1 Like

Not off the shelf but we do have the framework to accomplish it with some scripting. I don’t believe anyone has tried it, so take it in that spirit.

You could use the Aggregation api which invokes a StreamUDF written in Lua, but 3 caveats:

  1. Will have to write a stream udf, in lua, which in its aggregate function will implement the HLL union algorithm n lua. (Not currently written by anyone AFAIK.) If you do write it, please share back.

  2. Aggregation api does not take a list of records. It either operates on all the records in the namespace or a subset based on a secondary index. But in the stream udf you can add additional filtering logic so you can whittle it down to the records you are interested in. Your data model should allow that. For e.g. all records where bin1=3 (Secondary index query) ==> stream of records, then in udf:filter → only use records where bin2>4 && bin5 == “CA” …and so on.

  3. Finally, streamUDFs will return this HLL union value back to the client, you will have to then store it in whatever record you wish to store it in. StreamUDFs are read only and cannot modify any record.

It would be great to have possibility to retrieve from ae server union from HLLValue without sending “raw” data to client