I’m currently getting quite a few timeout exceptions even though we have a timeout set to 5 seconds, and we’re simply grabbing 1 key. Admittedly we have around 20k connections but I would assume Aerospike could handle this no problem? Here’s a snippet:
Thu, 16 Feb 2017 16:30:20 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3239-1"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:30:23 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3239-1"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:31:30 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3086-2"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:32:21 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3239-1"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:33:05 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3239-1"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:33:34 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3239-1"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:35:37 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3088-2"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:35:37 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"temp01","set":"device_data","key":"7e803f93ee144b46f70c54bb9a00664a35604476"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:35:58 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3086-2"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:36:31 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"3088-2"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Thu, 16 Feb 2017 16:39:01 +0000 FATAL Error while getting data from Aerospike server: Key [{"ns":"adspruce","set":"test_sites","key":"2915-1"}] : Error - [9]: Timeout: timeout=5000 iterations=1 failedNodes=0 failedConns=0
Any idea what could be causing these timeout issues, and whether there is anything I can change in config or benchmarks I can run to see if it is client related or whether I should just be increasing the timeout - already 5 seconds seems like long enough
What’s going on at the server at the same time? Try to correlate the log for the same time period (and/or post it here). It’s rarely a limitation of Aerospike, it has to do with config or the actual hardware. For example, are you seeing ‘queue too deep’ warnings on the server side?
What is your current connection count vs proto-fd-max?
Also, you say you’re just trying to get 1 key, are you able to connect at all (telnet xyz 3000)?
As robert said, an aerospike.log with the same timeframes would be helpful here… (and asadm statistics/conf wouldnt hurt either)
Sorry @rbotzer - only just getting back to this now after being on leave…
I’ve had to do some log rotations as it wasn’t working (CentOS 7) but can’t see anything in there so far that you’ve mentioned. There was an issue with shm not working in the FPM .ini config but I’ve resolved that now, however we still see quite a few timeouts as shown above.
we have 2 x R430 servers running the cluster running with 220Gb Ram - not sure on cores etc at the minute.
To answer @Albot question, we were running at 15k proto-fd-max but we hit and surpassed that now to have it running at 50k - our FPM config is running with 500 children handling 10k connections each - and we have 10 front end app servers. When we hit the connection limit I was unable to telnet in, but when we see timeout issues I can telnet / ssh in fine and the servers are running fine.
One other issue we are seeing, is according to checkMK, aerospike seems to be holding on to TCP connections, but in the ESTABLISHED state, not the WAIT state like others have seen. I’m not sure whether this is normal behaviour or a symptom of something else?
If you need more data / config / asadm outputs let me know.
We are doing a lot of secondary index queries yes, we basically look up a number of records based on an indexed boolean field being true - but we’re not talking millions of records, mainly hundreds or a few thousand - nothing I would have thought was very taxing on the system.
At the time of writing, we’re doing somewhere between 1k and 1.5k TPS in the query field according to AMC. Also, checking the connections, it simply seems as if they are opening and closing, but not all of them - there is a definite and slow increase of a ‘baseline’ which makes me think something is not closing them properly, or there is an issue with the client possibly which may be doing it?
I have just restarted the entire cluster, and we now have the following:
This is a very quiet period for us, so this is to be expected - I will keep an eye on it now and see how it increases and how quickly - it simply seems as if some connections are not being released and these are multiplying over time
Okay, so just checked now and connections is up to around 1k on both nodes, throughout and traffic still the same as it was 12 hours ago. Just seems to be steadily and slowly increasing.
I’m wondering if some of the fixes in 3.4.14 have not worked?
Fix persistent connection issue under php-fpm which caused number of threads to increase
Fix shared memory persistence under php-fpm
There was a recent bug reported for secondary index pipes not being closed correctly that show similar patterns for me. I think they are releasing a fix in the next version
Basically, I restarted at around 8pm on the 2nd March, and the connections have been steadily increasing ever since over and above any peak periods.
This would make sense in terms of your bug report, as around 50% of the queries we are running are secondary index queries. Any ETA on the release of the fix?