Network bottleneck on GCP

Hello,

I have an Aerospike cluster on GCP:

  • 21 nodes 16 CPU / 64 GB
  • Only in memory
  • Between 4 and 5 MTps.
  • Last version 3.14.1.2-1, kernel 4.4.0-93-generic

It seems that now the network is the bottleneck.

  • Each node uses 500 MBit/s of bandwidth.
  • Sometimes I have this error in kern.log:

net eth0: Unexpected TXQ (2) queue failure: -28

  • In aerospike.log, I can see some error:

could not create heartbeat connection to node xxx

  • Tuning of txqueuelen / net.core.rmem_max / net.ipv4.tcp_rmem / net.ipv4.tcp_congestion_control does not change anything.
  • The only workaround I have found is to do ethtool -L eth0 combined 8, or 4 (default value is 16). It seems to help a lot, I’m not sure why.
  • Some time kicking ‘bad performing’ server out of the cluster seems to help. May be some GCP nodes have less available bandwidth.
  • Adding or removing nodes (19 or 23 instead of 21) does not change perfomance a lot (with more nodes, the global load is lower, but latency issues are still here).

Here are some questions:

  • do you think that 21 small servers is too high for aerospike ? I prefer to have small server to avoid transactions bottleneck.
  • do you have any idea of what I can tune on network config ? Or what I can check / monitor ?
  • do you think using two network cards (one for clients, one for the cluster) can help ? It seems to be possible on GCP, but not easy. And I do not think it will be mapped on different physical network cards.

Thx