Extremely slow query times, optimization tips appreciated


#1

Hello everyone my company has started using Aerospike with Go about a month or two ago. Things were super fast but we started running out of memory so we had to expand into a cluster (Currently 4 nodes).

Long story short we were super fast and it was great now things are slow, painfully slow. It takes about 5 minutes to query about 1.5k records.

We currently have about 10 sets where the object count is 30k - 170Mil depending on the set.

The query is just your basic

SELECT key, bin FROM ns.set WHERE bin = 0

And of course this is in code form

// Abstracted and ugly/just test code not what production looks like haha
func someFunc() {
	var err error

	total := 0
	set := "someset"
	bin := "meta"
	key := "id"

	t := time.Now()

	stmt := as.NewStatement("pexeso", set, bin, key)
	stmt.Addfilter(as.NewEqualFilter(bin, 0))

	rs, err := aeroClient.Query(nil, stmt)

	if err != nil {
		fmt.Println(err)
	}

	for res := range rs.Results() {
		if res.Err != nil {
			fmt.Println(res.Err)
			continue
		}
		total++
	}

        rs.Close()

	fmt.Println("Minutes:", time.Since(t).Minutes())
	fmt.Printf("db loaded: %d\n", total)	
}

I have been also investigating QueryPolicies, the only thing that helps is in the BasePolicy setting a Timeout to a few seconds. The issue with this though it that we end up loading 0-200 records when we really need 10k.

Any idea on how we can speed this up or any glaring flaws in the way I approached Aerospike?

Thank you all for reading and look forward to your responses


#2

Which version of AS are yo running? Are all the nodes the same hardware? How many network hops involved between servers in cluster and also between App and AS? Is this latency reflected in the histogram? Execute show latency inside asadm, while code is running, and let us know the output pleas.


#3

We are on the following versions, 3.11.1.1 [2], 3.12.0 [2].

All the nodes SHOULD be the same or similar hardware, our entire stack runs via Google Compute Engine.

There is only a single hop that takes just over 1ms. I don’t think there is any issue with the network side of things.

But here is the output of the show latency command.

Admin> show latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~query Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       Node                 Time   Ops/Sec    >1Ms    >8Ms   >64Ms
                          .                 Span         .       .       .       .
aerospike-001:3000   20:42:14->20:42:24       0.6   100.0   100.0   100.0
aerospike-002:3000   20:42:18->20:42:28       3.8   44.74   36.84   23.68
aerospike-003:3000   20:42:17->20:42:27       6.1    62.3    62.3    62.3
aerospike-004:3000   20:42:17->20:42:27       7.1   69.01   42.25   42.25
Number of rows: 4

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms
                          .                 Span         .      .      .       .
aerospike-001:3000   20:42:14->20:42:24       0.1    0.0    0.0     0.0
aerospike-002:3000   20:42:18->20:42:28       4.4    0.0    0.0     0.0
aerospike-003:3000   20:42:17->20:42:27       8.1    0.0    0.0     0.0
aerospike-004:3000   20:42:17->20:42:27       5.5   3.64   1.82    1.82
Number of rows: 4

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       Node                 Time   Ops/Sec   >1Ms    >8Ms   >64Ms
                          .                 Span         .      .       .       .
aerospike-001:3000   20:42:14->20:42:24    1002.7   8.99    0.47    0.39
aerospike-002:3000   20:42:18->20:42:28     728.8   0.48     0.4    0.33
aerospike-003:3000   20:42:17->20:42:27     719.8   1.17    0.53     0.5
aerospike-004:3000   20:42:17->20:42:27     872.9   38.1   36.61   35.14
Number of rows: 4

At this point we might redesign around to Postgres but I’m still curious on why this wasn’t scaling like it should.


#4

Ok the query histogram is tracking the latency. This is good because this means we should be able to track it down! I’ve never seen latency like that, so I would say first to you that this is not normal and we should be able to fix it.

So just to confirm, the nodes inside the cluster Are only up to 1ms apart AND the application calling the cluster is only 1ms away?

Can you send a few snapshots of that latency while he query is running ? What’s interesting right off this bat is it seems that 1 node is slower than the others…

More details and tuning queries: http://www.aerospike.com/docs/operations/manage/queries

Monitoring and investigating latencies: http://www.aerospike.com/docs/operations/monitor/latency


#5

Please also check out Aerospike’s Cloud Qualification, a summary can be found here.

Though it doesn’t yet support GCP, the same principles will apply. Basically we have seen significant performance differences among instances of the same class with several cloud providers. The Cloud Qualification project allows the user to gain more control over this variability.