I have an Aerospike cluster on GCP:
- 21 nodes 16 CPU / 64 GB
- Only in memory
- Between 4 and 5 MTps.
- Last version 126.96.36.199-1, kernel 4.4.0-93-generic
It seems that now the network is the bottleneck.
- Each node uses 500 MBit/s of bandwidth.
- Sometimes I have this error in kern.log:
net eth0: Unexpected TXQ (2) queue failure: -28
- In aerospike.log, I can see some error:
could not create heartbeat connection to node xxx
- Tuning of txqueuelen / net.core.rmem_max / net.ipv4.tcp_rmem / net.ipv4.tcp_congestion_control does not change anything.
- The only workaround I have found is to do ethtool -L eth0 combined 8, or 4 (default value is 16). It seems to help a lot, I’m not sure why.
- Some time kicking ‘bad performing’ server out of the cluster seems to help. May be some GCP nodes have less available bandwidth.
- Adding or removing nodes (19 or 23 instead of 21) does not change perfomance a lot (with more nodes, the global load is lower, but latency issues are still here).
Here are some questions:
- do you think that 21 small servers is too high for aerospike ? I prefer to have small server to avoid transactions bottleneck.
- do you have any idea of what I can tune on network config ? Or what I can check / monitor ?
- do you think using two network cards (one for clients, one for the cluster) can help ? It seems to be possible on GCP, but not easy. And I do not think it will be mapped on different physical network cards.