How can I run a scan job in map-reduce mode?

scan
hadoop

#1

We will run a all-nodes’ scan job every day and get something out from the result of scan. Now we’re thinking of change one round of all-nodes’ scan to a map-reduce job. But I find that there’s something which may make this attempt failed. I haven’t found any simple way to split the scan job into multiple sub-tasks. For example, I have 8 nodes in one aerospike cluster, the only pattern of split I can imagine which may distribute the data set for mapper is to scan the 8 nodes using 8 mapper respectively. We have 4 sets in one node and of course I can scan each set of the nodes using more mappers, but this will cause the unrebalance of the data sets for each mapper.

So is there any way for me to scan one set in one node using multiple processes?


#2

I don’t quite understand what you want to do, but if you want to split a scan job up into chunks - that’s possible. I think you can use the digest modulo feature. http://www.aerospike.com/docs/guide/predicate.html . using Digest Modulo you should be able to spawn a program and have it scan, say 10% of the data set with each thread.


#3

@Albot thank you very much. What you provide is exactly what I’m looking for. But I’m very sorry that we’re using aerospike 3.9.0.3. So is there any work round to get this feature in 3.9.0.3? By the way, I think I can divide a scan task into multiple processes not just threads using digest modulo, is that right?


#4

Just upgrade! :slight_smile: don’t know another way