Connections stuck in CLOSE_WAIT

I’m seeing a situation where over a short period of time connections count ramps up dramatically, and the connections seem to be stuck in CLOSE_WAIT

Here’s a sample timeline from graphite of [connections, timestamp]

[871.0, 1423801310], [871.0, 1423801312], [871.0, 1423801314], [4094.0, 1423801316], [4094.0, 1423801318], [4094.0, 1423801320], [4094.0, 1423801322], [4094.0, 1423801324], [4094.0, 1423801326], [4094.0, 1423801328], [4094.0, 1423801330], [4094.0, 1423801332], [4094.0, 1423801334], [4094.0, 1423801336], [4094.0, 1423801338], [4094.0, 1423801340], [4094.0, 1423801342], [4094.0, 1423801344], [18460.0, 1423801346]

The vast majority of those 18,000 connections were stuck in CLOSE_WAIT on the server side:

asd 9962 root 206u IPv4 3552341 0t0 TCP ip-XXX.ec2.XXX:3000->ip-XX.ec2.XXX:57412 (CLOSE_WAIT)

This has intermittently happened several times, and I was only able to resolve it by hup’ing asd.

For what it’s worth, this is with the PHP client library.

I see the following presentation recommending setting proto-fd-idle-ms to 10 seconds for php, but even that wouldn’t have helped in this situation, because the connection limit was hit in the space of a couple of seconds.

Hey Cody,

Thanks for reporting the problem. I’d like to dig into this, and first I want to try and reproduce it consistently.

Can you add information about which release of the PHP client are you using, the OS and version on the client side, and the version of the server?

Are your scripts using persistent connections? Is this a webserver context (for example Nginx + PHP-FPM)? If so could you give the webserver configuration?

Are you using shared memory for cluster tending? Could you specify what the aerospike.shm.* values are in your php.ini?

Thanks! Ronen

aerospike php client version 3.3.9
Amazon Linux AMI release 2014.09 3.14.27-25.47.amzn1.x86_64
aerospike-server-community-3.4.1-el6

It looks like Aerospike::__construct does a persistent connection by default, which is what we're using.
Yes, Nginx + PHP-FPM
nginx-1.6.2-1.22.amzn1.x86_64

Differences from nginx base config:
worker_rlimit_nofile 200000;
tcp_nopush     on;
keepalive_timeout  0;
open_file_cache max=200000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
gzip on;
fastcgi_buffers 8 16k;
fastcgi_buffer_size 32k;


cat /etc/php.d/aerospike.ini:

extension=aerospike.so
aerospike.udf.lua_system_path=/opt/aerospike/client-php/sys-lua
aerospike.udf.lua_user_path=/opt/aerospike/client-php/usr-lua

Alrighty. I’ll look into it.

Can you please try to explicitly set a true second parameter to the constructor. If that doesn’t change things, can you try to make use of the shared-memory cluster tending?

You would add to your php.ini:

aerospike.shm.use=true

Ronen

One more thing, how many PHP processes do you have at the point where you see 18k CLOSE_WAITs? Can you quote your php-fpm.conf?

The issue seems to happen at low load times of the day (ie middle of the night), so approx 150 php processes across all fronted machines.

There was a very noticeable load difference switching to explicit false for persistent connections, so it seems fairly likely they were on. We’ll try shared memory cluster tending.

Relevant portions of php-fm config are:

pm = dynamic
pm.max_children = 150
pm.max_requests = 500
pm.max_spare_servers = 50
pm.min_spare_servers = 10
pm.start_servers = 30

Thanks for the details. We’re looking into it.

Have exactly the same issue with Java client too. The Aerospike server is practically idle (waiting for production deployment), Aerospike client webapp is started but not getting any traffic. Server version 3.5.12, java client 3.1.2

Please let me know if you need more info or want me to make any configuration changes.

Two separate issues here. If the server side is idle it may be reaping the connections due to under-utilization you will see CLOSE_WAITs, but I don’t think it’ll be in the thousands, and if the client isn’t doing much with the server then it’s not really a problem.

It’s a relatively expensive operation (in terms of CPU and time) to initialize the client. It needs to learn the cluster topology after it connects to the seed node, then open TCP connections to each of the newly discovered nodes. In most clients (Java, Go, C#, etc) we do this once, then hold onto the client and send all requests through it.

PHP, Python, and Ruby usually approach web applications in a different way. Since traditionally the problem was memory leaks in the interpreted code, the ‘solution’ was to severely limit the number of request each process should handle, fork new ones, and kill the ones that maxed out their requests.

I would first suggest to raise the max_requests value as much as possible, while monitoring the processes for their memory consumption. The less the aerospike object gets recycled (and with it opening and closing connections), the less this will occur.

We intend to investigate this further, however. I’d rather it worked better with a ‘standard’ FPM configuration.

@cody is that PHP non-ZTS? Check if path the extension_dir is something like /usr/local/lib/php/extensions/no-debug-non-zts-20131226.

@rbotzer Yeah, it’s /usr/lib/php/extensions/no-debug-non-zts-20131226/

Please try the new client release 3.4.6 and read the Configuration in a Web Server Context section of the overview.