Stream UDF over list of primary keys?


#1

Hello, I have an unusual application where at a certain point I get a list of primary keys. I need to perform some computations using the data stored in a bin of each of those records and some data that I provide, in a decentralized way, aggregating the results at the client.

Is there a way similar to a stream UDF to do it? My problem is that I cannot setup a stream because I have a list of primary keys and not the output of a secondary index.

Thanks


#2

Seems like what you need is a query object constructor which takes the namespace and an array of digests from the same namespace as its arguments? or alternately an array of Key Objects from the same namespace as its argument? Interesting…I don’t see any thing in the API that can help you do that.

I am thinking of one approach you can take. For the array of keys you have, have a ‘tag’ bin, (add this bin with default value of 0 to all records, then for your set of records put a value (say 1). Have a numeric SI on tag and then run a query with a where predicate ==1, do your stream udf, (stream udf cannot change the record), after udf is done with its aggregation, set the tag bin back to 0 so you don’t pollute your record set for the future.

Just thinking out aloud…


#3

Thank you for your reply.

Your idea was also my first thought but I think it creates some concurrency issues. The problem is that my list of primary keys is returned as a result of a different query on another namespace (specifically one sitting in RAM, while the one for the second query is on SSD). That would mean that after the first query is run the tag has to be set and then the second query run using that tag as index. However, if there is a concurrent first query, it would also try to set its own tags…

It seems to me that in that case I would need to somehow define “set tag, run SI, revert tag” as an atomic operation. Is that possible at all?


#4

Aerospike has record level lock only. Stream udfs are read only on records and do not lock all the records in the stream during aggregation. They aggregate one record at a time, although in parallel across all nodes and final reduce on client.

You are going across namespaces for aggregation - that is another level of complexity.

Another wild idea. Try if you can iterate over your list of keys, (I assume you are storing the key in both namespaces using ‘sendKey’), running a record UDF, pass another record (for eg key=myaggregation) as additional argument where you compute and aggregate data for each record, record by record. This will be much slower but may do the trick. I dont know for sure if it will work because you are going from one record lock to another record lock within the udf.