Very slow fetching of records with scanAll()

spring
index
secondary
scan

#1

Hi,

I started evaluating Aerospike on my notebook to see it in action, but after setting up a sample web application with help of Spring Framework - I was somewhat astonished to see that a scanAll() operation which returns 50 elements from a given set takes over 1000ms :open_mouth: . I know that my development machine is not a speed daemon and it has no SSD disk, but I expected much faster response times. When I replace the repository bean with one that connects to a Couchbase instance - then I get the same set of records in 90-100ms. I’ve tried setting higher priorities in clientPolicy.scanPolicyDefault.priority, but the result is the same.

Here is my namespace configuration (the rest of aerospike.conf is default):

namespace tpotest {
        memory-size 2G
        default-ttl 0 
        storage-engine memory
        storage-engine device {
                data-in-memory true
                filesize 8G
                file /opt/aerospike/data/tpotest.dat
        }
}

And here is AerospikClient configuration:

	@Bean
	public AerospikeClient getAerospikeClient(){
		ClientPolicy clientPolicy = new ClientPolicy();
		clientPolicy.readPolicyDefault.timeout = 500;
		clientPolicy.readPolicyDefault.maxRetries = 4;
		clientPolicy.readPolicyDefault.sleepBetweenRetries = 150;
		clientPolicy.writePolicyDefault.recordExistsAction = RecordExistsAction.UPDATE;
		clientPolicy.failIfNotConnected = true;
		clientPolicy.timeout = 500;
		clientPolicy.maxThreads = 100;
		clientPolicy.queryPolicyDefault.priority = Priority.HIGH;
		clientPolicy.queryPolicyDefault.recordQueueSize = 100;
		clientPolicy.scanPolicyDefault.concurrentNodes = false;
		clientPolicy.scanPolicyDefault.includeBinData = true;
		clientPolicy.scanPolicyDefault.priority = Priority.HIGH;
		clientPolicy.scanPolicyDefault.timeout = 500;

		return new AerospikeClient(clientPolicy, "127.0.0.1", 3000);
	}

What else can I try to make it work as it should (blazing fast)? :smile:

I’m using:

  • Aerospike Server Community Edition 3.4.0
  • Ubuntu 14.04
  • Aerospike Java Client 3.0.32
  • a notebook with 2 CPU cores and 8GB of RAM.

Regards, Tomasz


#2

Hi Tomasz

I can do a scan of 99,999 records in ~700 milli seconds. The client is Java running on my Mack Book Pro and the server is running in a Virtual Box VM. So all on the same machine.

The Java client takes a moment to “warm up”, you may being incurring the connection cost as part of your test. Try running your test twice and measuring the time of the 2nd test.

Regards Peter


#3

Hi Peter,

I wouldn’t bother anyone here just after running it once or twice :). I also have similar results when running a query from aql tool - it also fetches my 50 records in 0,9 seconds. Apparently there is something wrong in my setup (I’m using default out-of-the-box values), but initially I posted this issue to Java Client group as I had thought it was related… Anyway, maybe someone here can point me in the right direction?

Regards, Tomasz


#4

Hi Tomasz

To eliminate the Java client configuration, can you try your test without the client policy?

Also, are you running a vm on your laptop, if so what is the host OS?

Regards


#5

Hi Peter,

I’m testing Aerospike directly without any VM under Ubuntu 14.04 LTS. What I have tried so far:

  • using AerospikeClient.scanAll() - yields 10 records in around a second
  • using AerospikeClient.query() with a secondary index - similar time
  • using both above without any ClientPolicy applied
  • using aql tool:
aql> SELECT login FROM test.users
10 rows in set (0.880 secs)

which gives me such an entry in aerospike.log:

Jan 13 2015 07:10:28 GMT: INFO (scan): (thr_tscan.c::280) SCAN JOB DONE  [id =6: ns= test set=users scanned=10 expired=0 set_diff=0 elapsed=976 (ms)]

(debug logging shows that 4096 partitions were scanned…)

When I query for records by PK then I get records instantly (aql reports 0.000 to 0.001s for a single record).

Regards, Tomasz


#6

Hi Tomasz

You are getting some very unusual results. I’m guessing something is configured wrong. Can you post your full aerospike.conf file, and can you post a description of your record. Whit this I will simulate your issue.

BTW: Where are you located? - I’m teaching Aerospike for Developers in Vilnius next week and in Berlin the week after; if you can make it to either location, I can put you on the class for no cost.

Peter


#7

Hi Peter,

I also suspect a configuration issue, yet I’m testing with out-of-the-box configuration which I presume should be correct in 99% of cases (mine is likely falling into that missing 1%). Yesterday I even compiled it from sources, used aerospike_dev.conf settings, but still - no improvement. You can see my configuration and logs in this ZIP file where I added: /var/log/aerospike/aerospike.log, config files tested plus output from dmesg, free, lsmod, sysctl -a and my /proc/cpuinfo.

Thanks for the training offer, but I won’t make it on such a short notice (though Berlin is not that far away from Bielsko-Biała, Poland).

Regards, Tomasz


#8

Oh, forgot to add details about my records, but they are very simple:

INSERT INTO test.users (PK, login) VALUES ('login1', 'login1')

and then it’s repeated for 'login1''login10'. So I created a set users and put there 10 records with a single text bin login. After that SELECT * FROM test.users in aql gets me those 10 records in 0.8s :(.

Tomasz


#9

Yesterday, after having talked to one of your Solution Engineers, I tried to provide a “bare-metal” configuration and switched off all graphical interface processes and left the necessary processes running only. Yet, there was no significant improvement and full-scan timings for my little data set were similar. Mysterious, isn’t it? :smile:

Regards, Tomasz


#10

Hi Tomasz

I finally got a chance to write some code to duplicate issue. Here is what I have found:

Scenario - 100,000 records, each containing 1 string ~10 characters and 1 integer

aql running on the same node (eliminating network latency, but adding CPU usage) select * from test.demo 100000 rows in set (0.860 secs)

Java coded application to scanAll with the bins on the same node $ java -jar aerospike-top-10-1.0-full.jar -S Scan time for 100000 is 853 milliseconds

You can find the code at https://github.com/helipilot50/aerospike-top-10-aggregation.git, it is part of another example.

To duplicate follow the instructions in the README.md to build a runnable jar.

  • Run the jar (as shown) with the -l option to load data
  • Run the jar with the -S option to scanAll

Using this example, running on the same node as your server, you will eliminate the network latency and any code differences between the scan you have coded and aql.

hope this helps Peter


#11

Hi Peter,

Thanks for that sample code. With different set sizes, using your aerospike-top-10-aggregation -S I got these results:

Scan time for 5000 is 908 milliseconds
Scan time for 50000 is 1139 milliseconds
Scan time for 100000 is 1145 milliseconds
Scan time for 250000 is 1425 milliseconds
Scan time for 500000 is 1832 milliseconds

(I run each test 10x on given set, above are the fastest times).

It seems that on my test environment there is like 800-900ms of initialization/warm-up(?) phase each time and then the rest is actually depending on the amount of data. I will give it a try somewhere else to see if it isn’t high time to upgrade my development hardware :wink:

[edit] On a server with memory-only storage I get 100K records in 155ms and 1M records in a second, so I think I’ll give up here and buy a new development machine…

Regards, Tomasz


#12

Hi Tomasz, I looked at your scenario and did testing in my local environment on my laptop where I’ve VM machine(Vagrant) running with aerospike server(3.4.0).

-> I had created set called “login” and bin called “login” and inserted 10 records as below:- aql> select * from test.login 10 rows in set (0.281 secs)

-> Then ran scanAll() API call on above set ,so scanAll completed in around ~250 to 290 milliseconds range Scan time for 10 record is 290 milliseconds

-> Lastly ran AerospikeClient.query() call on secondary indexed bin “login” with filter setting as below

stmt.setFilters( Filter.equal(“login”, “login2”) );

So I got 1 record fetched in around ~8 to 10 milliseconds range all the time as below .

  • Record 1 is: {login=login2}
  • Query time for fetching 1 record is 9 milliseconds

So querying small data sets should take few milliseconds rather than seconds. Also all these depends on network latency,IO controller,amount of data … I noticed in your conf file that “proto-fd-max” value is not the default one i.e. 15000 .Try with default value Also , you can run asloglatency command to check how much time your operations are taking i.e. % of operation with respect to time. Here is the link for Log latency check - http://www.aerospike.com/docs/tools/asloglatency/


#13

Hi Tomasz,

There is a warm up period, especially with the Java client, because the class loader needs to load everything and JIT it as it is used. You see this in short test, but not so much in production.

Peter


#14

Hi Anand,

I tested both config files (aerospike_dev.conf bundled with sources of aerospike-server and aerospike.conf that comes with .deb package). I will see what this asloglatency script can tell and post any findings here.

In regard to fetching single records - this is very fast when querying by key or for an indexed bin, I’m only curious why it’s so slow on my machine when doing a scan for the whole set (which has 10 records :wink: ).

Regards, Tomasz


#15

Hi Peter,

By “initialization/warm-up phase” I meant some server-side operations that seem to eat those 800ms in each scan and then the rest of time seems to depend really on the amount of data. Of course you’re right with class loader and JIT compilation, but it all happens before you start timing getting back of results from your TopTen.scanAll() method and shouldn’t have any influence here.

Regards, Tomasz


#16

Hi Tomasz

You are right, this is a warmup time on the server too, and it is more obvious in a single server cluster.

Peter


#17

Hi,

@anandprakash, @helipilot50: I finally got some time to dig more into Aerospike Server’s code, run my tests with detailed logging and get to know (more or less) the idea behind “scan” operation. I only wish I had found this page from Operations Manual earlier as it seems I was using the wrong tool for my task. Now when I use a secondary index on additional “type” bin that resembles my set’s name - I get my 10 test records in 30-40ms on my dev machine.

Regards, Tomasz


#18

That’s great, Tomasz. Scans are long-running, resource-intensive jobs, so they should be implemented properly to get the expected result since they do entire namespace scan which might slow down your operation. So, it’s good to have secondary indexed bin to query and get filtered data.