NodeJS Client Scan is not Asynchronous


#1

Hi!

I currently have one NodeJS process which performs a full Scan on a given set. This is a master process which has 4 child processes and communicate each other (using process.send). It seems that when the master process is grabbing records, everytime a child makes a “send” to give a message to the parent, that message is being queued until the master process finishes the scan. An example:

var statement = { concurrent: true }

var scan = client.query(“namespace”, “set”, statement);

var stream = scan.execute();

Then only after the event “end” of the stream the master is able to receive the queued messages from childs.

Is there any way to make the stream asynchronous?

Thanks in advance!


#2

@fdnieves, you are raising a valid concern. The data is fetched from the Aerospike cluster nodes asynchronously and queued up in the client locally. On the main event loop the queue is being drained and the callbacks are being processed. That means, if the queue is being filled up faster than the records are being processed, no other events can get processed until the scan is completed.

Work is underway to switch to an asynchronous processing model using the newly released C client version 4.0. But we are also discussing within the team whether there is some short term solution to this specific issue in the interim. I will keep you posted on the outcome.


#3

Thanks a lot for the information Jan and for keeping me updated with any news regarding this concern.


#4

Hi Jan, how are you? Just wanted to know if there are any news regarding this topic. Thanks!


#5

Your timing is excellent! I have just released version 2.0.0-alpha.2 of the Aerospike Node.js client. One of the main features in this 2nd pre-release is a new Scan API which uses asynchronous, i.e. non-blocking IO.

Here is a quick example of how you would use the new Scan API (it’s very similar to the existing Query API in the v1.x client):

 const Aerospike = require('aerospike')

 Aerospike.connect({ hosts: '192.168.33.10' }, (error, client) => {
   if (error) throw error

   var scan = client.scan('test', 'demo')
   scan.priority = Aerospike.scanPriority.LOW // scan with low priority
   scan.percent = 50 // scan only 50% of all records in the set
   scan.concurrent = true // scan all nodes in parallel

   var recordsSeen = 0
   var stream = scan.execute()
   stream.on('error', (error) => { throw error })
   stream.on('end', () => {
     console.log('finished!')
     client.close()
   })
   stream.on('data', (record) => {
     console.log(record)
     recordsSeen++
     if (recordsSeen > 100) stream.abort() // We've seen enough!
   })
 })

To get the pre-release version you have to specify the exact version number or include the pre-release tag if you specify a range:

{ "aerospike": ">2.0.0-alpha" }

Some further information:


#6

What a coincidence! Really glad to hear this news, I’m going to test it and let you know the results.

Thank you very much Jan !


#7

Hi Federico, I released v2.0 final last week. Was wondering whether you had a chance to try out the new Scan API introduced in alpha2 and whether it fixed the issues you were previously seeing with the v1 client?

Best Regards, Jan


#8

Hi Jan, thanks for the reminder.

Been trying to make it work but couldn’t, it raises the following error for me:

/path/to/node_modules/aerospike/lib/query.js:355 var scanOptions = new Set([‘UDF’, ‘concurrent’, ‘percentage’, ‘nobins’, 'pri ^ ReferenceError: Set is not defined at assertValidQueryOptions (/path/to/node_modules/aerospike/lib/query.js:355:25) at new Query (/path/to/node_modules/aerospike/lib/query.js:152:3) at Client.query (/path/to/node_modules/aerospike/lib/client.js:1217:10) at get_cookies (/path/to/retargetly/test.js:30:32) at main (/path/to/retargetly/test.js:10:2) at Object. (/path/to/retargetly/test.js:57:1) at Module._compile (module.js:456:26) at Object.Module._extensions…js (module.js:474:10) at Module.load (module.js:356:32) at Function.Module._load (module.js:312:12)

Do you have any idea what is happening? Here is the example code:

var asdRecord = 0

var statement = { ‘filters’: [] }

var scan = asd_client.query(“relypt”, “domain”,statement);

// scan.concurrent = true

// scan.percent = 1

var stream = scan.foreach();

stream.on(‘data’, function(record) {

asdRecord++;

console.log(asdRecord);

record = null;

});

stream.on(‘error’, function(error) { console.log("Error occuring while scanning "+error); });

stream.on(‘end’, function(end){

console.log("TOTAL RECORDS SCANNED: %d", asdRecord);
// end_of_process("NO BATCHES")

});

Thanks!!


#9

Hi Federico,

Are you using Node.js v0.10? The Set class is part of the ES6 spec and was added to Node.js in v0.12.

Unfortunately, with v2.0 of the Aerospike Node.js client we had to drop support for Node.js v0.10. That version of Node.js uses a very old, pre-1.x version of libuv which doesn’t work with the new async IO in the v2 client.

Node.js 0.10 reaches the end of it’s life in October 2016, 0.12 will follow soon after in December 2016. [1] Maybe you can consider upgrading to the v4 LTS release?

Cheers, Jan

[1] Node.js Long Term Support Release Schedule: https://raw.githubusercontent.com/nodejs/LTS/master/schedule.png


#10

Hi Jan!

Sorry for the delay here, too much work to do.

Yes now I’ve upgraded nodejs version and tested the code, it works really well! Congratz for that!

I will try to test it a little bit further with real cases and maybe implement it on some processes. Will let you know if anything comes up!!

Thanks for the replies and the hard work on this, really happy about this new feature :slight_smile:


#11

Great to hear that the new client is working well for you. Thanks for the feedback!


#12

Jan one question, is there any way I can pause the scan, execute some other code and then resume where it was?

Thanks!


#13

Unfortunately, there isn’t at the moment. You can abort a scan/query at any time, but there is currently no way to resume it.

This feature has been requested before, e.g. see here and here. I am looking into ways to support this for some future release. But no promises as to the timeline at this point.


#14

I’ve made one script that uses two files, one that scans a database set, and the other receives every record and calculates things.

The connection is via sockets using https://nodejs.org/api/net.html .

The thing is that the scripts that makes calculations process records slowers than the scan receives them. So every X quantity of records we put the scan process to sleep (10 seconds) so the process queue doesn’t eat up all the memory.

Do you see any harm in doing this? I already checked and all the records are read normally (there are no duplications or less records read).

Off topic: Another issue, it seems that “scan.percent = 1” is not working, the scan reads all records despite of the value in percent.

Thanks!


#15

What is the purpose of splitting the processing over two scripts? Is it to better utilise multiple CPUs on your machine, i.e. are you running multiple copies of the script that does the heavy calculations? You might want to look at Node’s Cluster module in that case. Or are the two scripts running on separate machines as well?

If you can execute the scan and the processing in a single process, then the speed at which the records are read by the scan pretty much regulates itself automatically due to the way Node.js’s event handling works. There is an test case that shows this pretty well in the test/stress/scan.js test. The second test definition uses a busy loop to simulate heavy computation like in your scenario. Since this will keep the event loop busy, Node will have less time to process I/O events, incl. from network socket connections. If the client isn’t able to read data from the socket fast enough, the Aerospike server will automatically throttle the amount of data it sends. That way the client will not use up excessive amounts of memory while queuing the data. I went into this in quite some detail in a recent blog post.

Does this make sense?

Cheers, Jan


#16

Hi Jan!

Thanks for all the information and the detailed explanation. It is a great analysis that you’ve made.

For our use case, what we do is a full scan, process each record, and then if the record is elegible for update, we make a put of two different documents on different sets.

The thing is that the scan goes very fast which is great, but the generated puts cannot be executed all together because otherwise the aerospike client hits timeout on that operation. Seems like too many puts on the same time makes the system get overloaded, so puts cannot be executed as rapidly as the scan gets records. So what we did is we put a queue were we extract “put” operations every 4 mili seconds. And here comes the problem, the queue starts growing because the scan grabs records much faster that put writes them. So the memory goes up and up and there is a point where memory allocation fails and the process exits with error.

Do you have any idea if there is anything else that we could do? The sleep option still is a working one, but we want to make this as good as possible.

Thanks again for everything Jan !