Slower than expected performance on aggregations


#1

Hi

I am testing an aerospike cluster with real data (25M recrods) and am getting unexpectedly slow performance for aggregations. The system is running on AWS as a cluster of 2 machines both r3XXLarge instance types in the same zone. I have 1 nic on each machine and have run the afterburner script as well as setting RPS for each machine. I have set the query-threads to 8 (as that’s the core count). I have a secondary index which I am performing the aggregation against as follows: function count(s) function mapper(rec) return 1 end local function reducer(v1, v2) return v1 + v2 end return s: map(mapper) : reduce(reducer) end

The results from aql are: aggregate analytics_getdatafordate.count() on testdata.pk where apikeyid = 7

±---------+ | count | ±---------+ | 12505912 | ±---------+ 1 row in set (25.152 secs)

The secondary key is a numeric. This is effectively the same as running a ‘count(*) from’ query on an SQL database. On a development machine with the SQL database this data comes from the query response time is 2 seconds. Below is the result of the get-config:

transaction-queues=8 transaction-threads-per-queue=3 transaction-duplicate-threads=0 transaction-pending-limit=20 migrate-threads=1 migrate-xmit-priority=40 migrate-xmit-sleep=500 migrate-read-priority=10 migrate-read-sleep=500 migrate-xmit-hwm=10 migrate-xmit-lwm=5 migrate-max-num-incoming=256 migrate-rx-lifetime-ms=60000 proto-fd-max=15000 proto-fd-idle-ms=60000 proto-slow-netio-sleep-ms=1 transaction-retry-ms=1000 transaction-max-ms=1000 transaction-repeatable-read=false dump-message-above-size=134217728 ticker-interval=10 microbenchmarks=false storage-benchmarks=false ldt-benchmarks=false scan-priority=200 scan-sleep=1 batch-threads=4 batch-max-requests=5000 batch-priority=200 nsup-delete-sleep=100 nsup-period=120 nsup-startup-evict=true paxos-retransmit-period=5 paxos-single-replica-limit=1 paxos-max-cluster-size=32 paxos-protocol=v3 paxos-recovery-policy=manual write-duplicate-resolution-disable=false respond-client-on-master-completion=false replication-fire-and-forget=false info-threads=16 allow-inline-transactions=true use-queue-per-device=false snub-nodes=false fb-health-msg-per-burst=0 fb-health-msg-timeout=200 fb-health-good-pct=50 fb-health-bad-pct=0 auto-dun=false auto-undun=false prole-extra-ttl=0 max-msgs-per-type=-1 service-threads=8 fabric-workers=16 pidfile=/var/run/aerospike/asd.pid memory-accounting=false udf-runtime-gmax-memory=18446744073709551615 udf-runtime-max-memory=18446744073709551615 sindex-populator-scan-priority=3 sindex-data-max-memory=18446744073709551615 query-threads=8 query-worker-threads=15 query-priority=10 query-in-transaction-thread=0 query-req-in-query-thread=0 query-req-max-inflight=100 query-bufpool-size=256 query-batch-size=100 query-sleep=1 query-job-tracking=false query-short-q-max-size=500 query-long-q-max-size=500 query-rec-count-bound=4294967295 query-threshold=10 query-untracked-time=1000000 service-address=0.0.0.0 service-port=3000 mesh-seed-address-port=10.0.101.126:3002 reuse-address=true fabric-port=3001 network-info-port=3003 enable-fastpath=true heartbeat-mode=mesh heartbeat-protocol=v2 heartbeat-address=10.0.101.216 heartbeat-port=3002 heartbeat-interval=150 heartbeat-timeout=10 enable-security=false privilege-refresh-period=300 report-authentication-sinks=0 report-data-op-sinks=0 report-sys-admin-sinks=0 report-user-admin-sinks=0 report-violation-sinks=0 syslog-local=-1 enable-xdr=false forward-xdr-writes=false xdr-delete-shipping-enabled=true xdr-nsup-deletes-enabled=false stop-writes-noxdr=false reads-hist-track-back=1800 reads-hist-track-slice=10 reads-hist-track-thresholds=1,8,64 writes_master-hist-track-back=1800 writes_master-hist-track-slice=10 writes_master-hist-track-thresholds=1,8,64 proxy-hist-track-back=1800 proxy-hist-track-slice=10 proxy-hist-track-thresholds=1,8,64 writes_reply-hist-track-back=1800 writes_reply-hist-track-slice=10 writes_reply-hist-track-thresholds=1,8,64 udf-hist-track-back=1800 udf-hist-track-slice=10 udf-hist-track-thresholds=1,8,64 query-hist-track-back=1800 query-hist-track-slice=10 query-hist-track-thresholds=1,8,64 query_rec_count-hist-track-back=1800 query_rec_count-hist-track-slice=10 query_rec_count-hist-track-thresholds=1,8,64

Thanks

Craig.


#2

Craig,

Did you mean r3.2xlarge ?

Looking at the count of around 12.5 million. The system has processed around 12.5 million records in 25 seconds on two of these boxes. That is around 250ktps perf node.

What is your expectation of processing time here ? What does TOP (all cores ) look like when you run this ?

– R


#3

Hi Raj,

Yes r3.2xlarge is what we have run up. I was expecting something similar to SQL Server performance. I’m trying to figure out if I just have the configuration wrong, or if it won’t work for our use case. I’ve gotten pretty good responses for small datasets (< 10K), so it seems to find records quickly, although still not as fast as SQL server, however, once the result sets get a bit larger the performance degrades quickly. I thought this may be a lua issue, but I am really unsure. I’ll find some time later this week and get the TOP for you.

Thanks

Craig.


#4

Craig,

This IMO is not apple to apple comparison. Count(*) is natively supported operation in SQL Server and other such RDBMS, for which response is returned using stats, it DOES NOT walk through entire data set.

The way you are performing count in Aerospike is really walking through entire data set (I know that is only way given count operation is not supported natively)…which is reasonably heavy weight op.

The valid comparison would be when you perform a little sophisticated aggregation function which will need SQL server as well to read the data …

– R


#5

Hi Raj

Ok, fair enough. I started by performing a group by in SQL and an aggregate function in Aerospike, the performance was unexpected. Grouping in SQL server 44 seconds the entire dataset 25M records grouped and ordered by dates and categories with the plan confirming lookups on the records for data not present in the keys with clean buffers. 75 seconds from Aerospike only grouped by category. I used the flights example as a base for the aggregate function. I also used an int as the map key to reduce lua allocs for strings and the secondary index was on the category (int). There are 16 categories. I performed no date calculations in the lua. That’s when I thought I’d move to a simple count function in Aerospike to see what I’d get there in case it was just a configuration issue. Would it help if I turned on microbenchmarks?

Here are my lua funcs:

local function mapper(itemMap, rec)
  local currentKey = rec.apikeyid
  local k = itemMap[currentKey]
        if k == null then
            k = map {calls = 0, apikey = rec.apikey}
			itemMap[currentKey] = k
          end
  k.calls = k.calls + 1
  return itemMap
end

local function merge_calls(a, b)
  a.calls = a.calls + b.calls
  return a
end

local function reducer(a, b)
  return map.merge(a, b, merge_calls)
end

function test_stream(stream)
  return stream : aggregate(map(), mapper) : reduce(reducer)
end

I guess I’m trying to validate if I am using the right techniques for retrieving data.

Thanks

Craig.


#6

In a ‘real’ implementation there would be plenty of pre-grouping at the insert stage etc, just running some cursory tests first before delving into a full POC.


#7

Craig,

What is your namespace configuration (data in memory / data on disk) ?

What is the time it takes to simply fetch the result (no aggregation) … just do select * from ns.set where sindex_bin between a and b ?

Output of

asinfo -v 'get-config:'

from both the nodes.

– R