Hello everyone my company has started using Aerospike with Go about a month or two ago. Things were super fast but we started running out of memory so we had to expand into a cluster (Currently 4 nodes).
Long story short we were super fast and it was great now things are slow, painfully slow. It takes about 5 minutes to query about 1.5k records.
We currently have about 10 sets where the object count is 30k - 170Mil depending on the set.
The query is just your basic
SELECT key, bin FROM ns.set WHERE bin = 0
And of course this is in code form
// Abstracted and ugly/just test code not what production looks like haha
func someFunc() {
var err error
total := 0
set := "someset"
bin := "meta"
key := "id"
t := time.Now()
stmt := as.NewStatement("pexeso", set, bin, key)
stmt.Addfilter(as.NewEqualFilter(bin, 0))
rs, err := aeroClient.Query(nil, stmt)
if err != nil {
fmt.Println(err)
}
for res := range rs.Results() {
if res.Err != nil {
fmt.Println(res.Err)
continue
}
total++
}
rs.Close()
fmt.Println("Minutes:", time.Since(t).Minutes())
fmt.Printf("db loaded: %d\n", total)
}
I have been also investigating QueryPolicies, the only thing that helps is in the BasePolicy setting a Timeout to a few seconds. The issue with this though it that we end up loading 0-200 records when we really need 10k.
Any idea on how we can speed this up or any glaring flaws in the way I approached Aerospike?
Thank you all for reading and look forward to your responses
Which version of AS are yo running?
Are all the nodes the same hardware?
How many network hops involved between servers in cluster and also between App and AS?
Is this latency reflected in the histogram?
Execute show latency inside asadm, while code is running, and let us know the output pleas.
Ok the query histogram is tracking the latency. This is good because this means we should be able to track it down! I’ve never seen latency like that, so I would say first to you that this is not normal and we should be able to fix it.
So just to confirm, the nodes inside the cluster Are only up to 1ms apart AND the application calling the cluster is only 1ms away?
Can you send a few snapshots of that latency while he query is running ? What’s interesting right off this bat is it seems that 1 node is slower than the others…
Though it doesn’t yet support GCP, the same principles will apply. Basically we have seen significant performance differences among instances of the same class with several cloud providers. The Cloud Qualification project allows the user to gain more control over this variability.