Connection errors from PHP client

We run Aerospike server 3.5.15-1 on Ubuntu 14.04 and periodically getting server connection errors from PHP clients ([-1]Unable to connect to server). PHP client version 3.4.1

This application creates up to 400 simultaneous connections to Aerospike. We use r3.xlarge EC2 instance and server has plenty of available resources.

We followed Aerospike tuning documentation and tried updating proto-fd and recommended OS patameters on the server, but it didn’t help

   proto-fd-max 100000
   proto-fd-idle-ms 15000

That’s how we initialize and use Aerospike:

$opts = array(Aerospike::OPT_CONNECT_TIMEOUT => 1250,Aerospike::OPT_WRITE_TIMEOUT => 5000);
$this->db = new Aerospike($config, false, $opts);

//set key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$aero_value = array("value" => $value);
$status = $this->db->put($aero_key, $aero_value, $ttl, $options);

//get key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$status = $this->db->get($aero_key, $result);

Aerospike stats before the crash:

Aug 27 2015 19:32:50 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (237, 16073516, 16073279) : hb (0, 0, 0) : fab (16, 16, 0) Aug 27 2015 19:33:00 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (334, 16076516, 16076182) : hb (0, 0, 0) : fab (16, 16, 0) Aug 27 2015 19:33:10 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 1 ::: dq 0 : fds - proto (288, 16079478, 16079190) : hb (0, 0, 0) : fab (16, 16, 0) Aug 27 2015 19:33:20 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (131, 16082477, 16082346) : hb (0, 0, 0) : fab (16, 16, 0) Aug 27 2015 19:33:30 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (348, 16084665, 16084317) : hb (0, 0, 0)

There are no any corresponding errors in the server logs and server didn’t have to be restarted. So, the problem seem to be on the client side. Connections created from php-fpm.

Hi. First, I’d like to understand the problem. Are you seeing connection errors or a crash. And if it’s a crash, are you talking about the PHP process crashing or the server?

Next, how are you using PHP? Is it a CLI script (single, multi-process), is it a daemon, is it running in a webserver context (if so which server, and is it PHP-FPM or …)?

Last, it would be helpful to know which OS you’re on, the webserver type and version (if that’s the SAPI you’re using), and which PHP client release?

Hi Ronen,

There is no crash on server or client sides. We receive Aerospike connection error on PHP client and then client would drop connections and automatically reconnect.

We run connections from the Nginx web server 1.9.1-1 on Ubuntu 12.04.5 LTS, kernel 3.2.0-36-virtual. This web server is in AWS environment runs on c3.4xlarge instance. PHP and PHP-FPM version: 5.3.10 At this point I suspect that this issue is likely related to either Aerospike PHP extension or some local networking issue.

Is the error which I am getting ([-1]Unable to connect to server) a generic one and could indicate any network-related issue including a local one?

Thanks, Eugene.

(Notes from the other thread:)

From the log segment, we can see that there are around 300 client connections open on the node at any one time, well under the 100000 limit in proto-fd-max.

If you are using multicast for heartbeats (and I think you are), the heartbeats of 0 are fine.

I expect that you have already looked at this, but are you able to check network connectivity between the client and server at the time of the failure? I know that under normal conditions, the client and the server happily coexist, but at the time of the failure, do you see any basic connectivity problems?

Do you happen to have other applications installed on the client machine? Do they have any similar failures, possibly at the time of the Aerospike client problems?

Do you have the client installed on more than one server? Do you maybe only see the connectivity errors on one of the servers?

I know you have already been looking at this, so I apologize if I am covering topics that you have already reviewed.

Thank you for your time,

-DM

One quick question - are the nodes of the cluster and the client instances all in the same Availability Zone within the same region? There is a common class of connection problems in EC2 that is covered in detail in this GitHub issue.

Basically, if your client cannot access all the nodes in the cluster you may hit this problem once you’re trying to reach a record on that node. On EC2 it is essential that your access-address contain the public IP address of the node in case your instances are not all on the same subnet. When the client connects it first establishes a connection to the seed node, learns about the other nodes, then it connects to those. Each of the other nodes is described with its access-address, and by default that will be the local IP address (because the public IP isn’t visible to the node). Please look at what I wrote in the issue I linked above.

Hi David,

We currently run Aerospike on a single node, so there are no multicast communications. I’ve done a network debugging, but haven’t found any issues so far. Other applications on the same machine don’t have such issues. My next step is to try running Aerospike client from different node with newer Ubuntu and PHP version.

Thanks, Eugene.

I only use a single Aerospike client node and this node is in the same availability zone as a PHP client. Client communicates to AS server through standard AWS ec2-xx address which resolves as a private IP.

Can you share the php.ini used by PHP-FPM? Mainly I’d like to know what the max_requests is set to. For reference, see this discussion.

We don’t set max_requests, so it disabled (0) by default

listen.backlog = -1
pm = dynamic
pm.max_children = 400
pm.start_servers = 75
pm.min_spare_servers = 50
pm.max_spare_servers = 300
request_slowlog_timeout = 10s
request_terminate_timeout = 0

Okay, please give it a high, explicit max_requests (100k, for example), make sure you’re using persistent connections, and configure aerospike.shm.use=true. Let me know how it behaves after that.

Also, if you could use the new PHP 3.4.2 release, that would help.