Experiencing pauses in data when running AerospikeClient.scanNode

I am experiencing an odd behavior when running scans using AerospikeClient.scanNode (java API) to retrieve all records in a set. (Some background that may or may not be relevant: This code is running inside hadoop map-reduce, in which there is one mapper for each node).

There is a long delay (1-3 minutes) before the first record is returned, then a period of 1-3 minutes where records are returned (at a rate of a few thousand per second), then a 1-3 minute delay, then 1-3 minutes of data, and so on.

There are no other jobs running. The cluster appears to be otherwise healthy, responding to a few thousand non-job related requests per second.

Does anyone have any suggestions for what may be causing this, or how to proceed investigating?

Thanks,
Marc

Hi Marc

You might have a heap or GC issue in your java application.

Are you using scanAll(), or query() without a filter, to retrieve the records?

regards Peter

I am using scanNode().

I increased the heap (it was at 200MB) and the problem went away.

Unfortunately we will never know if this was the real problem, because between running with the smaller heap and the bigger heap, our system went through a rolling upgrade. This is our production system and I don’t have the ability to conduct experiments on it. Thanks for your help.

Hi Marc

If you use query() with NO filter it is the same as a scan(), but the advantage is that the records are returned through a RecordSet in a controlled manor. There is a blocking queue between your application and the records that the nodes are returning. When the queue fills, the nodes pause the scan jobs. When you read from the queue, the scan jobs resume.

Its a very controlled way of reading large RecordSets and it makes it easier to control the heap space requirements.

I hope this helps

Peter

Interesting. We actually implemented a similar solution when using scan (using a blocking queue, which would block inside our implementation of the callback method) to cope with this exact problem. Perhaps we should have used query instead?

It’s up to you, but query controls the jobs running on each node via the protocol, so it is my favorite. Be sure to close the record set when you are finished with it.

Regards

Peter