We were conducting some performance/load testing on Aerospike for geospatial queries, but we are not very satisfied by the result. For GeoContaingPoint queries, we saw average 100 TPS, and it varies a lot (20-2k) depending on the input point.
The test dataset we were using is from OpenStreetMap, which has roughly 173K polygons(~1.7G). We then performs GeoContainingPoint queries using Go client library, and capture the result from the benchmark tool(slightly modified):
The performance varies a great deal depending on the input geopoint. For densely populated area(e.g. LA/SF), performance degrades to around 20TPS, whereas for uninhabited area we saw almost 2K TPS. Is this expected?
CPU utilization on the cluster are very low(<2%), but increasing the client parallelization would only increase latency but not TPS, why?
Eric, I agree, this seems really low —
One bare-metal hardware, with the same OSM system, we typically found 50k QPS; at that point the CPU on the servers would be saturated. 100 to 2k is far lower than we would expect.
Hopefully our eng team will have some time to investigate, give a cogent response, although I know we have some exciting features coming out, thus they might not get to the analysis immediately.
Thanks for the quick response. We’re very keen on getting Aerospike to production but the performance number is a blocker now. Hopefully we can get some pointer from your engineering team soon.
We are still waiting for your help. While waiting, I ran profiling and it shows a sign of lock contention at as_record_get_live() within query_io() (in thr_query.c, that goes down to olock_vlock() in as_index_sprig_get_vlock() in index.c). Changing partition-tree-locks and partition-tree-sprigs did not help. Can anyone take a look?
When you got 50k geo qps, what was the aerospike configuration/geo data/queries?
We ran benchmark again with more diverse query set, expecting to spread out the hot-keys. This time we got better TPS, but still less than half of the vCPU cores are busy. If there is a way to fully utilize the available vCPUs and get better TPS, please let us know. Thanks.
I have a further question. There’s no statement in the original question about whether the client is single threaded, or not. You want to make sure you’re running multiple client threads, and as Wchu says, you’ll want to make sure you’re not out of client horsepower - or network. Both network and client horsepower scales by adding more nodes.
We were running on a 16 core machine using the benchmark tool shipped with Go client library(slightly modified to tailor geo query). The result we got was from using 64 goroutines, and we found increasing concurrencies would only increase latencies but not overal TPS, a sign of overloading. But what you suggest make sense, it could also be due to network. We will test again tomorrow with more distributed clients and post result here. Thanks!
We re-did the test with distributed 100 clients, but the behavior is similar, the total QPS is still under 2k(aggregated from all clients), and server cpu utilization is less than 10%.