Query by secondary index performance

secondary
index

#1

I’m benchmarking aerospike on production like distributed database with 500 mln records on it. I use java async client and hundreds thousands of requests to load the cluster. I can see that cluster in overall can support 5 times less queries by secondary index than queries by PK.

I can not deduce such a drammatic difference from indexes documentation and the results of benchmarking actually mean that aerospike secondary indexes functionality might not be used on production for our case.

Is such behaviour expected for aerospike?

Thanks


#2

My guess would be that your results come from the underlying network activity going on here. A secondary index is held on every node for its own records only. Access to this has to open a network connection to every node (correct me, if I’m wrong). Querying on PK is easier, because the smart client knows which single node to contact for that record due to the underlying DHT-concept.

It’s just a guess but I’ve seen nearly all high performance NoSQL-DBs to be network-bottlenecked. However it would be interesting to know what exactly is to blame for your results. Have you tried increasing the amount of client nodes? What’s the general setup (hardware/software-wise)?

Another approach you could try is to create my LDT’s in a memory-held namespace like a list of all clicks related to a single advertiser or whatever you are modeling (I dislike single, big indices if queries are performed for only a certain subrange like a given customer id or what so ever) before merging the result on the clientside. I’m sure there is a way to model your business case with AS. I’ve seen that they offer a free consulting talk for this, maybe you can talk to the guys who implemented the whole secondary index feature. Just for interest: what kind of OPs/sec did you achieve with secondary indices?

Cheers, Manuel


#3

Folks,

Secondary index is collocated with data on a node. Query is scatter gather approach. Wherein while performaing query it goes to all the nodes. Here are few things to keep in mind while looking at secondary index performance.

  1. Because request goes to each node and client has to wait for each node to come back with result. Due to queueing effect the general latency of secondary index will be little higher than the single get put. Did you look at server to see system is fully utilized, If not may be you need more parallel clients.

  2. What does you configuration looks like. Given query subsystem supports all kinds of queries, low cardinality / high cardinality / long running / short running / slow client. There is bunch of tuning parameters which may need tweaking to achieve best performance.

    asinfo -v ‘get-config:’ -l | grep query

  3. If you are using this to have unique secondary index that is cardinality of secondary index key is as much as primary and there is only 1 row retrieved. Then visit to n-1 nodes in cluster to retrieve result is actually useless work. You may want to pick the alternative model of A -> B and B->A such that you can look up using both A and B. Insert / update / query using sindex with this model would be two round trip. Could be (2 + delta) x slower but won’t be 5x as observed.

  4. Ofcourse LDT is an option if what you really are doing is looking up data from primary key but you just want to filter out data while doing so. Example would be click tracking. You may track the clicks over long period of time but needs it only of last day or so. But this still is PK lookup.

HTH – R


#4

Hey Manuel, achieved performance was about 40k reqs/second for secondary indexes.


#5
  1. I think system was fully utilized. I got exceptions form AS while adding additional clients.

  2. The configuration is probably close to defaults with 2 as replication factor and 7 machines in cluster.

  3. This is approach we think about as an alternative.

  4. LDT will probably not work for us

Thanks