Broken pipe errors and apparently random migrations (and a fix)

We have been seeing some issues recently where we would get what appeared to be cluster membership changes (and then migrations, which would impact latency).

The logs would report messages like this: May 29 2019 18:41:28 GMT: WARNING (hb): (hb.c:5138) (repeated:1) sending mesh message to 1117 on fd 736 failed : Broken pipe

We were seeing these quite frequently (1000s of times an hour across the cluster), and sometimes (but not always) these would cause a node to completely disappear and get evicted from the cluster for a short period of time (maybe 10-30 seconds). Inspection of the traffic (via tcpdump) suggested that packets would just go missing for a period of time, and while the possibility of hardware (or cable) issues was there, it didn’t seem likely to affect all of the nodes across multiple clusters.

This was most prevalent in our larger clusters (the specific cluster we have been tuning has 20 nodes, each with 768GB of RAM, and we are using TCP heartbeats – I wish we had gone with UDP, but we are stuck with TCP for now). Strangely, it was most apparent in a cluster that has only XDR traffic, no direct IOs.

It was our theory that we were processing packets too slowly, so in an effort to tune this we changed the following settings:

  1. Firstly we set tuned to run the network-latency profile. This does a number of things, but by itself wasn’t enough. So…
  2. sysctl vm.swappiness=0 – recommended here: How to tune the Linux kernel for memory performance
  3. %5 of ram > /proc/sys/vm/min_free_kbytes – also from the above link
  4. sysctl net.ipv4.tcp_slow_start_after_idle=0
  5. sysctl net.ipv4.tcp_no_metrics_save=1
  6. sysctl net.core.netdev_max_backlog=30000 The above three are from various sources (note that we use 10GigE, so these may not be applicable): TCP — Performance Tuning on Linux GitHub - ton31337/tools Tuning 10Gb NICs highway to hell – You shall not pass

It should also be noted that step #3 caused momentary loss of network on each node as we set it, so be careful doing this on live clusters. That also gives a pretty strong hint as to the performance issue we were seeing… apparently walking the page table to flush cached memory was taking a significant amount of time, and causing us to lose packets. The other settings helped, but it was #3 that finally reduced the broken connections down to nothing.

5% might be a little on the high side, and possibly 1-3% would have been enough, but due to the impact this has, I’m reluctant to experiment too much.

Suffice to say, this is going to be part of our standard tuning for AS nodes now, and hopefully this helps someone else.

1 Like

Thanks for taking the time to post this. Most people just figure out their problems and go on their way :upside_down_face: Did you get a chance to take a look at Upgrading to 10G network ? I haven’t seen a few of the sysctl params you posted, so i’ll check them out!!

I had not seen that one, but it contains the values I was considering tweaking for some further optimisations. Nice to see some confirmation, thanks!

1 Like