Very slow fetching of records with scanAll()

Hi,

I started evaluating Aerospike on my notebook to see it in action, but after setting up a sample web application with help of Spring Framework - I was somewhat astonished to see that a scanAll() operation which returns 50 elements from a given set takes over 1000ms :open_mouth: . I know that my development machine is not a speed daemon and it has no SSD disk, but I expected much faster response times. When I replace the repository bean with one that connects to a Couchbase instance - then I get the same set of records in 90-100ms. Iā€™ve tried setting higher priorities in clientPolicy.scanPolicyDefault.priority, but the result is the same.

Here is my namespace configuration (the rest of aerospike.conf is default):

namespace tpotest {
        memory-size 2G
        default-ttl 0 
        storage-engine memory
        storage-engine device {
                data-in-memory true
                filesize 8G
                file /opt/aerospike/data/tpotest.dat
        }
}

And here is AerospikClient configuration:

	@Bean
	public AerospikeClient getAerospikeClient(){
		ClientPolicy clientPolicy = new ClientPolicy();
		clientPolicy.readPolicyDefault.timeout = 500;
		clientPolicy.readPolicyDefault.maxRetries = 4;
		clientPolicy.readPolicyDefault.sleepBetweenRetries = 150;
		clientPolicy.writePolicyDefault.recordExistsAction = RecordExistsAction.UPDATE;
		clientPolicy.failIfNotConnected = true;
		clientPolicy.timeout = 500;
		clientPolicy.maxThreads = 100;
		clientPolicy.queryPolicyDefault.priority = Priority.HIGH;
		clientPolicy.queryPolicyDefault.recordQueueSize = 100;
		clientPolicy.scanPolicyDefault.concurrentNodes = false;
		clientPolicy.scanPolicyDefault.includeBinData = true;
		clientPolicy.scanPolicyDefault.priority = Priority.HIGH;
		clientPolicy.scanPolicyDefault.timeout = 500;

		return new AerospikeClient(clientPolicy, "127.0.0.1", 3000);
	}

What else can I try to make it work as it should (blazing fast)? :smile:

Iā€™m using:

  • Aerospike Server Community Edition 3.4.0
  • Ubuntu 14.04
  • Aerospike Java Client 3.0.32
  • a notebook with 2 CPU cores and 8GB of RAM.

Regards, Tomasz

Hi Tomasz

I can do a scan of 99,999 records in ~700 milli seconds. The client is Java running on my Mack Book Pro and the server is running in a Virtual Box VM. So all on the same machine.

The Java client takes a moment to ā€œwarm upā€, you may being incurring the connection cost as part of your test. Try running your test twice and measuring the time of the 2nd test.

Regards Peter

Hi Peter,

I wouldnā€™t bother anyone here just after running it once or twice :). I also have similar results when running a query from aql tool - it also fetches my 50 records in 0,9 seconds. Apparently there is something wrong in my setup (Iā€™m using default out-of-the-box values), but initially I posted this issue to Java Client group as I had thought it was relatedā€¦ Anyway, maybe someone here can point me in the right direction?

Regards, Tomasz

Hi Tomasz

To eliminate the Java client configuration, can you try your test without the client policy?

Also, are you running a vm on your laptop, if so what is the host OS?

Regards

Hi Peter,

Iā€™m testing Aerospike directly without any VM under Ubuntu 14.04 LTS. What I have tried so far:

  • using AerospikeClient.scanAll() - yields 10 records in around a second
  • using AerospikeClient.query() with a secondary index - similar time
  • using both above without any ClientPolicy applied
  • using aql tool:
aql> SELECT login FROM test.users
10 rows in set (0.880 secs)

which gives me such an entry in aerospike.log:

Jan 13 2015 07:10:28 GMT: INFO (scan): (thr_tscan.c::280) SCAN JOB DONE  [id =6: ns= test set=users scanned=10 expired=0 set_diff=0 elapsed=976 (ms)]

(debug logging shows that 4096 partitions were scannedā€¦)

When I query for records by PK then I get records instantly (aql reports 0.000 to 0.001s for a single record).

Regards, Tomasz

Hi Tomasz

You are getting some very unusual results. Iā€™m guessing something is configured wrong. Can you post your full aerospike.conf file, and can you post a description of your record. Whit this I will simulate your issue.

BTW: Where are you located? - Iā€™m teaching Aerospike for Developers in Vilnius next week and in Berlin the week after; if you can make it to either location, I can put you on the class for no cost.

Peter

Hi Peter,

I also suspect a configuration issue, yet Iā€™m testing with out-of-the-box configuration which I presume should be correct in 99% of cases (mine is likely falling into that missing 1%). Yesterday I even compiled it from sources, used aerospike_dev.conf settings, but still - no improvement. You can see my configuration and logs in this ZIP file where I added: /var/log/aerospike/aerospike.log, config files tested plus output from dmesg, free, lsmod, sysctl -a and my /proc/cpuinfo.

Thanks for the training offer, but I wonā€™t make it on such a short notice (though Berlin is not that far away from Bielsko-Biała, Poland).

Regards, Tomasz

Oh, forgot to add details about my records, but they are very simple:

INSERT INTO test.users (PK, login) VALUES ('login1', 'login1')

and then itā€™s repeated for 'login1' ā€¦ 'login10'. So I created a set users and put there 10 records with a single text bin login. After that SELECT * FROM test.users in aql gets me those 10 records in 0.8s :(.

Tomasz

Yesterday, after having talked to one of your Solution Engineers, I tried to provide a ā€œbare-metalā€ configuration and switched off all graphical interface processes and left the necessary processes running only. Yet, there was no significant improvement and full-scan timings for my little data set were similar. Mysterious, isnā€™t it? :smile:

Regards, Tomasz

Hi Tomasz

I finally got a chance to write some code to duplicate issue. Here is what I have found:

Scenario - 100,000 records, each containing 1 string ~10 characters and 1 integer

aql running on the same node (eliminating network latency, but adding CPU usage) select * from test.demo 100000 rows in set (0.860 secs)

Java coded application to scanAll with the bins on the same node $ java -jar aerospike-top-10-1.0-full.jar -S Scan time for 100000 is 853 milliseconds

You can find the code at GitHub - helipilot50/aerospike-top-10-aggregation: Finding the top 10 with an Aerospike aggregation, it is part of another example.

To duplicate follow the instructions in the README.md to build a runnable jar.

  • Run the jar (as shown) with the -l option to load data
  • Run the jar with the -S option to scanAll

Using this example, running on the same node as your server, you will eliminate the network latency and any code differences between the scan you have coded and aql.

hope this helps Peter

Hi Peter,

Thanks for that sample code. With different set sizes, using your aerospike-top-10-aggregation -S I got these results:

Scan time for 5000 is 908 milliseconds
Scan time for 50000 is 1139 milliseconds
Scan time for 100000 is 1145 milliseconds
Scan time for 250000 is 1425 milliseconds
Scan time for 500000 is 1832 milliseconds

(I run each test 10x on given set, above are the fastest times).

It seems that on my test environment there is like 800-900ms of initialization/warm-up(?) phase each time and then the rest is actually depending on the amount of data. I will give it a try somewhere else to see if it isnā€™t high time to upgrade my development hardware :wink:

[edit] On a server with memory-only storage I get 100K records in 155ms and 1M records in a second, so I think Iā€™ll give up here and buy a new development machineā€¦

Regards, Tomasz

Hi Tomasz, I looked at your scenario and did testing in my local environment on my laptop where Iā€™ve VM machine(Vagrant) running with aerospike server(3.4.0).

ā†’ I had created set called ā€œloginā€ and bin called ā€œloginā€ and inserted 10 records as below:- aql> select * from test.login 10 rows in set (0.281 secs)

ā†’ Then ran scanAll() API call on above set ,so scanAll completed in around ~250 to 290 milliseconds range Scan time for 10 record is 290 milliseconds

ā†’ Lastly ran AerospikeClient.query() call on secondary indexed bin ā€œloginā€ with filter setting as below

stmt.setFilters( Filter.equal(ā€œloginā€, ā€œlogin2ā€) );

So I got 1 record fetched in around ~8 to 10 milliseconds range all the time as below .

  • Record 1 is: {login=login2}
  • Query time for fetching 1 record is 9 milliseconds

So querying small data sets should take few milliseconds rather than seconds. Also all these depends on network latency,IO controller,amount of data ā€¦ I noticed in your conf file that ā€œproto-fd-maxā€ value is not the default one i.e. 15000 .Try with default value Also , you can run asloglatency command to check how much time your operations are taking i.e. % of operation with respect to time. Here is the link for Log latency check - http://www.aerospike.com/docs/tools/asloglatency/

Hi Tomasz,

There is a warm up period, especially with the Java client, because the class loader needs to load everything and JIT it as it is used. You see this in short test, but not so much in production.

Peter

Hi Anand,

I tested both config files (aerospike_dev.conf bundled with sources of aerospike-server and aerospike.conf that comes with .deb package). I will see what this asloglatency script can tell and post any findings here.

In regard to fetching single records - this is very fast when querying by key or for an indexed bin, Iā€™m only curious why itā€™s so slow on my machine when doing a scan for the whole set (which has 10 records :wink: ).

Regards, Tomasz

Hi Peter,

By ā€œinitialization/warm-up phaseā€ I meant some server-side operations that seem to eat those 800ms in each scan and then the rest of time seems to depend really on the amount of data. Of course youā€™re right with class loader and JIT compilation, but it all happens before you start timing getting back of results from your TopTen.scanAll() method and shouldnā€™t have any influence here.

Regards, Tomasz

Hi Tomasz

You are right, this is a warmup time on the server too, and it is more obvious in a single server cluster.

Peter

Hi,

@anandprakash, @helipilot50: I finally got some time to dig more into Aerospike Serverā€™s code, run my tests with detailed logging and get to know (more or less) the idea behind ā€œscanā€ operation. I only wish I had found this page from Operations Manual earlier as it seems I was using the wrong tool for my task. Now when I use a secondary index on additional ā€œtypeā€ bin that resembles my setā€™s name - I get my 10 test records in 30-40ms on my dev machine.

Regards, Tomasz

Thatā€™s great, Tomasz. Scans are long-running, resource-intensive jobs, so they should be implemented properly to get the expected result since they do entire namespace scan which might slow down your operation. So, itā€™s good to have secondary indexed bin to query and get filtered data.