Firstly I’m not even sure where the best place to post this is, so I’ve put it in here as it’s kinda connected to the PHP Client library (in that it’s something running PHP) but I don’t even know if this is the cause or effect at this stage. I’m tagging @rbotzer and @kporter because I think this may require everyone’s skills ha ha.
Okay, so back story… we’re running Aerospike CE and have been quite successfully for some time now (couple of years) and everything has been hunky dory - however about 3 weeks ago, we updated our configs to make the memory larger and disk space smaller as our use case changed slightly. Once we restarted Aerospike, we ran into a problem about 10 minutes later of large connections and apps struggling with timeouts etc. Thinking this was related to the updates, we reverted the configs and restarted - but the same thing happened and we were noticing strange “sawtooth” patterns in New Relic of the apps accessing aerospike. Connections would steady out at around 800 per node, then every 10 minutes increase to over 5k before slowly coming back down to ~800 again. We brought the cluster down, updated configs for the desired end results (more ram, less disk usage), zeroed out the drives as we had the data and then restarted the cluster - still the same issue occurred.
Now, as an aside, we currently have a CRON job which takes data from MySQL, formats it and syncs it into Aerospike - this runs every 10 minutes and so we naturally thought there was a connection. We stopped the CRON job, and this only re-created the issue with the rising connections etc so instead we reduced the time to 5 minutes and this seemed to alleviate the problem with connections. However, we still see the “sawtooth” pattern in New Relic in terms of response time from Aerospike, but at least this meant the service remained stable.
This brings me on to the main issue we are having now. If we stop the CRON running the sync to Aerospike, at almost exactly the 10 minute mark, the connection rates increase massively (up to ~5k) and the whole system grinds to a halt, the apps time out, connections can’t be made etc and the whole platform will eventually die. If we start the CRON again and run a sync, the connections slowly drop and everything carries on as normal - which is why this is the strangest thing I’ve ever seen. I’m basically asking for your help in being able to debug how turning OFF a CRON job accessing Aerospike actually breaks aerospike, almost in 10 minutes cycles. I understand this is almost completely “I have no idea without looking” kind of scenario, but it’s the “looking” bit we need help with.
I’ve upped the log level, checked it and can’t immediately see anything obvious, no errors or anything. Does aerospike do anything in 10 minute cycles like GC etc which could cause this (and is reset by something accessing Aerospike)? We’ve updated to the latest version of both client (3.4.14) and server (188.8.131.52) now and still the same The cron accesses via PHP-CLI and the app servers are running PHP-FPM
I really hope you guys can help me in tracking down possible errors or symptoms, really appreciate the support.