Stream UDF over list of primary keys?

diev47 · January 25, 2017, 8:11pm

Hello, I have an unusual application where at a certain point I get a list of primary keys. I need to perform some computations using the data stored in a bin of each of those records and some data that I provide, in a decentralized way, aggregating the results at the client.

Is there a way similar to a stream UDF to do it? My problem is that I cannot setup a stream because I have a list of primary keys and not the output of a secondary index.

Thanks

pgupta · January 27, 2017, 6:10am

Seems like what you need is a query object constructor which takes the namespace and an array of digests from the same namespace as its arguments? or alternately an array of Key Objects from the same namespace as its argument? Interesting…I don’t see any thing in the API that can help you do that.

I am thinking of one approach you can take. For the array of keys you have, have a ‘tag’ bin, (add this bin with default value of 0 to all records, then for your set of records put a value (say 1). Have a numeric SI on tag and then run a query with a where predicate ==1, do your stream udf, (stream udf cannot change the record), after udf is done with its aggregation, set the tag bin back to 0 so you don’t pollute your record set for the future.

Just thinking out aloud…

diev47 · January 27, 2017, 10:12am

Thank you for your reply.

Your idea was also my first thought but I think it creates some concurrency issues. The problem is that my list of primary keys is returned as a result of a different query on another namespace (specifically one sitting in RAM, while the one for the second query is on SSD). That would mean that after the first query is run the tag has to be set and then the second query run using that tag as index. However, if there is a concurrent first query, it would also try to set its own tags…

It seems to me that in that case I would need to somehow define “set tag, run SI, revert tag” as an atomic operation. Is that possible at all?

pgupta · January 27, 2017, 2:58pm

Aerospike has record level lock only. Stream udfs are read only on records and do not lock all the records in the stream during aggregation. They aggregate one record at a time, although in parallel across all nodes and final reduce on client.

You are going across namespaces for aggregation - that is another level of complexity.

Another wild idea. Try if you can iterate over your list of keys, (I assume you are storing the key in both namespaces using ‘sendKey’), running a record UDF, pass another record (for eg key=myaggregation) as additional argument where you compute and aggregate data for each record, record by record. This will be much slower but may do the trick. I dont know for sure if it will work because you are going from one record lock to another record lock within the udf.

Topic		Replies	Views
Apply stream udf to all record User Defined Functions (UDF) udf , stream	5	2774	September 11, 2014
Counting distinct (unique) Query & Indexing query , secondary , udf , aggregation , index	4	6211	August 16, 2014
Accessing Forwarding Records From UDF User Defined Functions (UDF)	3	2038	January 3, 2015
Distinct bin udf User Defined Functions (UDF) udf , python	4	2800	October 20, 2015
Looking to retrieve unique values example Feature Discussion	2	1720	July 15, 2015

Stream UDF over list of primary keys?

Related topics