I’m seeing a situation where over a short period of time connections count ramps up dramatically, and the connections seem to be stuck in CLOSE_WAIT
Here’s a sample timeline from graphite of [connections, timestamp]
[871.0, 1423801310], [871.0, 1423801312], [871.0, 1423801314], [4094.0, 1423801316], [4094.0, 1423801318], [4094.0, 1423801320], [4094.0, 1423801322], [4094.0, 1423801324], [4094.0, 1423801326], [4094.0, 1423801328], [4094.0, 1423801330], [4094.0, 1423801332], [4094.0, 1423801334], [4094.0, 1423801336], [4094.0, 1423801338], [4094.0, 1423801340], [4094.0, 1423801342], [4094.0, 1423801344], [18460.0, 1423801346]
The vast majority of those 18,000 connections were stuck in CLOSE_WAIT on the server side:
asd 9962 root 206u IPv4 3552341 0t0 TCP ip-XXX.ec2.XXX:3000->ip-XX.ec2.XXX:57412 (CLOSE_WAIT)
This has intermittently happened several times, and I was only able to resolve it by hup’ing asd.
For what it’s worth, this is with the PHP client library.
I see the following presentation recommending setting proto-fd-idle-ms to 10 seconds for php, but even that wouldn’t have helped in this situation, because the connection limit was hit in the space of a couple of seconds.