High number of execution timed out errors


#1

Hello, I have been testing the golang client and am experiencing the following when I put the application under a stress test:

2016/06/24 08:12:17 command execution timed out: Exceeded number of retries. See Policy.MaxRetries

Aerospike conf:

service {
	user {U}
	group {U}
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 6
	transaction-queues 6
	transaction-threads-per-queue 8
	proto-fd-max 15000
	query-in-transaction-thread true
	
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}

network {
	service {
		address any
		port 3000
	}

	heartbeat {
		mode multicast
		address 239.1.99.222
		port 9918

		# To use unicast-mesh heartbeats, remove the 3 lines above, and see
		# aerospike_mesh.conf for alternative.

		interval 150
		timeout 10
	}

	fabric {
		port 3001
	}

	info {
		port 3003
	}
}

namespace {NS} {
	replication-factor 2
	memory-size 4G
	default-ttl 30d # 30 days, use 0 to never expire/evict.

	storage-engine memory
}

I am seeing no disk activity (essentially) while running this benchmark, everything runs in-memory. There is little to no networking overhead as I am running it on my local machine (Ubuntu).

I am also doing a single read command, to read a key which does not exist yet.

Client connection:

policy := aero.NewClientPolicy() policy.ConnectionQueueSize = 512 c, err := aero.NewClientWithPolicy(policy, "127.0.0.1", 3000)

Using wrk to send requests:

wrk -t2 -c1000 -d5m localhost:8080

any insight on improving throughput withoutn erros would be appreciated


#2

Hi,

Throughput for a key that doesn’t exist? :slight_smile:

The error you encountered means that you are encountering errors on parsing stage.

Could you provide a small code snippet for me try?

Thanks, K.


#3

I generally wouldn’t be testing a key that didn’t exist - however in this particular application there may be a lot of cache-misses and that was what I was trying to simulate.

I have added some code below, but as its part of a larger code base its kind of chopped together.

package main

import (
	"log"
	"github.com/valyala/fasthttp"
	"github.com/aerospike/aerospike-client-go"
	"errors"
	"runtime"
)

var (
	Client *aerospike.Client
)

func Init(poolSize int) (*aerospike.Client){
	policy := aerospike.NewClientPolicy()
	policy.ConnectionQueueSize = poolSize

	c, err := aerospike.NewClientWithPolicy(policy, "127.0.0.1", 3000)
	if err != nil {
		log.Panic(err.Error())
	}

	return c
}



func get(key string) (string, error) {
	k, _ := aerospike.NewKey("ns", "set", key)
	reply, err := Client.Get(nil, k)
	if err != nil {
		log.Println(err)
	}
	if reply == nil {
		return "", errors.New("Key not found")
	} else {
		return reply.String(), nil
	}
}

func main () {

	runtime.GOMAXPROCS(4)
	Client = Init(256)
	fasthttp.ListenAndServe(":8080", loaderHandler)
}

func loaderHandler(ctx *fasthttp.RequestCtx) {
	get("somekey")
}

#4

The app server and the client are competing for sockets. Since the db client can’t acquire a connection to the server, it times out, an after 3 retries it reports an error.

For example, if you prewarm the server with:

wrk -t2 -c100 -d5s localhost:8080

and then go to -c1000 you will fewer errors, since the db connections are already allocated.

You can also increase the number of file descriptors on your machine.


#5

I had FDs up to 65k and put them up to 650k just to test and it seemingly makes no difference, would you know a way I could better debug this?


#6

Did you try the trick I mentioned? Run wrk with small concurrency value first to warm up the server without pushing the number of FDs (so that the client pools a few connections to server), and then run with large concurrency values.

Seeing this, I may have to consider adding an option to preallocate all connections on initialization of a Client object.


#7

I did try with low and then high, however seemed to experience around the same number of errors (seemingly increasing over time).


#8

I have been trying to tweak various settings / etc but nothing seems to get rid of the error entirely. I will keep investigating solutions from my end.


#9

Sorry for the late reply. Being almost the other side of the globe, I’ll work on it tomorrow and will let you know about the results.

Thanks.


#10

Just in case, could you run the following two commands before your test? (works both in Mac and Linux)

  1. ulimit -S -n 2048
  2. go get -u github.com/aerospike/aerospike-client-go

#11

I tried this, still seems to be coming up at the same rate.

Edit:

The message obviously becomes more abundant with the more calls being made, I haven’t really found anything to quite mitigate or alleviate this as of yet aside from removing the calls entirely.


#12

I released a major update to the client just a few minutes ago. I can’t quite reproduce your issue on my mac, nor on the linux VM I have, so I’ll document what I do, and then we can compare notes:

On both Mac and Linux:

  1. I open a terminal, and I run ulimit -S -n 2048; I then run go run tools/debug/bench_max_retries.go

  2. I then open a second terminal and I run: wrk -t2 -c1000 -d10s http://localhost:8080

This does not produce any errors with Init(256) on any of the systems I test on. I get around >16K op/sec. Keep in mind that you need to use a valid namespace name when making your key.

Let me know if the new update changes anything, and whether you have a different environment.

Make sure you use go get -u ./....

Thanks,


#13

I have updated to your new release, and am getting your new error message - however I will post my information below:

  • OS: Ubuntu 14.04
  • Aerospike Version 3.8.2.3
  • Ulimit -s = unlimited
  • While the application is getting that error it is worth noting I only get about 270 timeouts in ~11s at a rate of 22.2k requests/second

New message is the following:

2016/06/28 14:04:43 command execution timed out: Exceeded number of retries. SeePolicy.MaxRetries. (last error: No available connections to the node. Connection Pool was empty, and limited to certain number of connections.)

I will be trying this (soon) from a couple bare metal servers to see if I get any differences - It would obviously be optimal to be able to get a decent amount of throughput out of a go implementation of aerospike.

I appreciate your help with this thus far!


#14

@Khosrow_Afroozeh what kind of design philosophy should I be taking with these applications? The client does not seem to be able to keep up with the amount of resources it needs to deal with high request volume (at least in parallel)


#15

You shouldn’t do anything, it’s the client’s concern to take care of such issues for you.

I’ve updated the client in the master with a better pool management logic. Please update the client and give it a try and let me know if the issue is resolved.

Thanks.


#16

That has seemingly resolved the errors, now I just need to raise my overall throughput and ill be golden!


#17

Let us know how it goes and if you need any help regarding performance or features.

Cheers,


#18

The struggles I am having is that aerospike (being an external service) is the slowest part of my application, it took what could be 30-50k req/s to less than 10k req/s due to the latency involved with calling the client library… However I will be looking into my implementation to see if I am doing something incorrectly.