I’m trying to apply a stream UDF to ALL records in the database. However if I don’t specify any filter for the statement passed to the queryAggregate method I just see an error in the log file:
Sep 04 18:33:49 aerospike aerospike[27729]: Sep 04 2014 16:33:49 GMT: DEBUG (udf): (udf_rw.c:udf_call_init:365) UDF scan op received
Sep 04 18:33:49 aerospike aerospike[27729]: Sep 04 2014 16:33:49 GMT: INFO (scan): (thr_tscan.c::756) Only Background Scan UDF supported !!
Sep 04 18:33:49 aerospike aerospike[27729]: Sep 04 2014 16:33:49 GMT: INFO (tsvc): (thr_tsvc.c::388) Scan failed with error -5
Regarding UDFs that are applied to records, there are Record UDFs, which could be applied to every record via a scan (that is run in the background), and there are Stream UDFs which can only be applied to the result of an index query (not a scan).
So, if you wanted to modify every record in the Database via UDF, you would perform a background scan and call a Record UDF on each record to perform some action (such as modify a bin). If you wanted to perform sort of aggregation operation across all records in the database, then you would do that via Stream UDFs. You would have to first make sure that every record shared a common field, then create a secondary index on that common field, then perform your query on that field and apply the result as the input into a Stream UDF.
@Toby this ties a bit into the eMail discussion about the bloom filters, which I wanted to populate from within Aerospike with a streaming UDF (to avoid the network IO for transferring every single record). I thought about the secondary index for a common field. However this would practically half the storage capacity of the whole cluster, right? Is there any technical reason why stream UDFs need a secondary index?
I guess I will just go down the simple but slightly inefficient scan road. Is there anything I have to be aware of - like issue with a scan if the nodes change?
It’s not by design that we support stream UDFs only from query results. It’s just where we are in the current implementation. In theory, the results of a scan and the results of a query should look and feel the same, but in reality, they actually use different mechanisms to get the record stream(s).
Here’s a bit of history and discussion.
When you think about how scans are done in a relational DB, an index scan and a table scan are just two fairly similar access methods – at least in terms of how they process the output. And, generally, a single table doesn’t hold terabytes of data. However, in distributed NoSQL, there’s somewhat of a different look and feel. In a setting where DBs are fairly large (e.g. in the multi-terabyte range), a distributed scan over an entire namespace is a pretty big deal. Hence, we currently require scans to run in the background. On the other hand, even though a secondary index does span all of the nodes, most secondary index queries are going to be quite a bit smaller. As a result, we run them in the foreground. In the long term, we plan to make this mechanism more general, although that doesn’t help you in the short term.
Now, as for scans over a cluster when nodes change. A scan has a ScanPolicy parameter that says what to do when things change:
All (or almost all) l of our examples set the “failOnClusterChange” parameter. If you continue during a cluster change, you might see inconsistent data.