Distribute fabric connections across multiple network links


#1

Hi, while it can be possible to aggregate network links bandwidth towards clients and in a many-cluster-nodes configuration through channel bonding/trunking, as far as I know the network traffic between two IPs always go through the same network link, thus limiting node-to-node bandwidth and node-to-client bandwidth to that of the link.

In a two-nodes cluster, I’m saturating a single link (1Gbit) bandwidth (10Gbit is not an option right now).

I noticed that cluster-fabric network traffic is already split across multiple connections, and configuring multiple addresses in the fabric{} stanza of the configuration makes Aerospike listen on the specified IPs, but network connections seem to always go towards the first listed address. Would it be possible to spread different connections across all listed addresses? I found no mention of this in de docs. In this way, it would be quite simple to aggregate traffic by just having different network links on different subnets, and having one IP per subnet in the fabric{} stanza of the configuration. This technique could also be used to aggregate bandwidth to/from clients (which already make multiple connections towards cluster nodes, there should just be multliple configurable advertised addresses ).

Thanks, regards.

Giacomo


#2

More on this topic here for reference: Using multiple network cards


#3

Hi Giacomo,

Yes, by default, Linux’s bonding network interface driver only uses the source and destination IP addresses to assign a TCP connection to a bonded network interface. So, if you have multiple TCP connections between two machines, they’d always use the same network interface.

However, you can change this behavior. The bonding driver can optionally also consider the TCP source and destination ports (in addition to the source and destination IP addresses). This would balance multiple TCP connections between two machines across all bonded interfaces.

The option that you want to pass to the bonding driver is xmit_hash_policy=layer3+4. This instructs the bonding driver to use IP addresses (layer 3) as well as TCP (or UDP) ports (layer 4) to select a network interface.

What’s worked in the past for me are the following bonding options:

mode=802.3ad xmit_hash_policy=layer3+4 miimon=100

I used this setup between clients and cluster nodes to avoid network bottlenecks during benchmarking. But it should just as well apply to fabric traffic. I think that this might be worth a try in your scenario.

Thomas


#4

Hi Thomas,

thanks for our reply. I know there are ways to distribute connections across links, one could also configure bonding in balance-round-roubin mode and distribute outgoing traffic evenly across all network links. What concerns me is the receiving side: in the majority of the cases the packets distributed on the sending side will actually be received on just one network link on the receiving side (this is because only one bonding slave will actually reply to ARP requests).

I’m unaware of how the packets are delivered in 802.3ad mode, the docs only talk about the sending part, and xmit_hash_policy=layer3+4 may not be supported in this mode, since it can lead to unordered packets delivery:

https://www.kernel.org/doc/Documentation/networking/bonding.txt

		This algorithm is not fully 802.3ad compliant.  A
		single TCP or UDP conversation containing both
		fragmented and unfragmented packets will see packets
		striped across two interfaces.  This may result in out
		of order delivery.  Most traffic types will not meet
		this criteria, as TCP rarely fragments traffic, and
		most UDP traffic is not involved in extended
		conversations.  Other implementations of 802.3ad may
		or may not tolerate this noncompliance.

What I don’t know, at this point, is if 802.3ad will spread connections across links on the receiving side also (and maybe this is the case, since switch support must be configured for this kind of link aggregation). I’m doing some research now to confirm this, and surely, if the network guys will be able to configure switches for 802.3ad, I’ll give this a try.

Thanks, Giacomo


#5

Hi Giacomo,

Oh, yes, right! I had forgotten about this. Yes, the switch does need to support layer 4, too. In my case, I was lucky enough to have a switch that supports a setting that’s equivalent to Linux’s xmit_hash_policy=layer3+4. And I did have to enable it on the switch.

What I suggested only helps with data going from Linux to the switch. Not in the other direction. You are correct.

Thomas


#6

Oh, wait a second… I might just have had another idea. Maybe we can use iptables as a somewhat hacky workaround?

I just tried something in my simplistic setup. I’m running two asd processes on my machine for a two-node cluster. One process uses ports 3000+, the other uses ports 3100+.

Let’s look at connections to port 3001. The other direction (to port 3101) should work analogously. In my case, fabric connections use 127.0.0.1. But we can use iptable's probabilistic NAT to rewrite 50% of fabric connections to port 3001 to connect to, say, 127.0.0.2 instead of 127.0.0.1.

I just did this:

iptables --table nat --append OUTPUT --protocol tcp --dport 3001 --match state --state NEW --match statistic --mode random --probability 0.5 --jump DNAT --to-destination 127.0.0.2

The --probability 0.5 setting applies the NAT rule to 50% of connections, which makes ~50% of fabric connections to port 3001 go to 127.0.0.1 (no NAT) and the other 50% go to 127.0.0.2 (NAT applied).

Both nodes are configured to listen on any IP address for fabric connections (and all 127.x.x.x addresses are just the same as 127.0.0.1).

So, refining the above iptables command and adapting it to your environment would lead to TCP connections of which 50% use one IP address and the other 50% use the other.

We’d have to chose the two IP addresses in a way that would make connections to them end up on two different interfaces: on Linux as well as on the switch. This might require trying a few IP addresses.

Or maybe it would be simpler to just use 10 different IP addresses and --probability 0.1.

In this way, we could do entirely without layer 4 and we wouldn’t have to touch the switch configuration, no? Or am I again missing something?

Thomas


#7

Hi,

regarding LACP/802.3ad, I found this thread:

and it seems that link aggregation happens on the receiving side, also (look at the results of the last test: it almost reaches 4Bbit/s) - still waiting for the network guy to know if we’re able to try it. In case, I’ll let you know.

Regarding iptables: I found that, also. I had not the occasion to test it, yet. I’m unsure if connection tracking will handle the rest of the connection or if that rule will actually only DNAT the first packet (matching the “–state NEW” condition). Can you confirm that rule is enough? Surely, this would cause a bit of overhead, anyway, since the kernel should have to check all the packets for DNAT.

Another thing I was considering is that with iptables you have no fault tolerance: if one of the links goes down, iptables won’t notice, and will continue to route a half of the new connections to the failed link. It would be up to Aerospike to detect the failing connections and keep retrying until all the connections come up on the working link (sure, one could write a simple link-monitoring script and change iptables rules according to link status). And then, there could be no failback if the failed link is restored. Actually, a failback would require the restart the of Aerospike process on the node with the failed/restored link (but this would probably be the case even if the feature I requested was implemented, unless someone added a link-failure/link-restore detection into Aerospike).

I made a variant of the iptables setup using HAProxy for load balancing and fault tolerance (if you’re interested in crazy things I can give you the details, although you may figure them out), but I’m afraid about the overhead: that would make 1Gbit/s of traffic from Aerospike to HAProxy inside one node, then 1Gbit/s from HAProxy to the other node, then there should be a second nat on the receiving side (iptables? socat? Another HAProxy?) and then from there to the receiving Aerospike. I tried this on a couple of virtual machines, it works. But to get an idea of the total overhead I should try with the real data stream.

I’ll post some results (if, and) as soon as I have them.

Thanks, Giacomo


#8

Hey Giacomo,

Yes, the iptables command is stateful, i.e., it applies the NAT to all packets of a TCP connection.

Also, I didn’t mean to suggest to use iptables NAT instead of bonding. What I meant to say is to use NAT across the bonding interface. In this way you wouldn’t lose fault tolerance.

Suppose that you have two Ethernet interfaces, eth0 and eth1, which are combined into one bonding interface, bond0. This is my thinking:

  • bond0 assigns an outgoing packet to eth0 or eth1 based on the source and destination IP address of the packet.

  • Our problem arises, because all fabric connections share the same source and destination IP address. This means that bond0 assigns them all to eth0 or all to eth1.

  • What we need to accomplish in order to use eth0 as well as eth1 is to make the different fabric connections look different to bond0. One way to do this is to make bond0 also consider the port information: different fabric connections have different (local) ports. This is the “layer3+4” thing that we looked into.

  • Another way to do this would be to not consider the port information, but instead modify the source or destination IP address of the packets of the TCP connections. This is what I was trying to accomplish with iptables NAT.

  • Suppose that we have fabric connections between addresses 1.1.1.1 and 2.2.2.2. Suppose that bond0 assigns these connections to eth0. My idea was to rewrite 2.2.2.2 to a different IP address, such that the connection to that different IP address would be assigned by bond0 to eth1. We would do this rewriting for 50% of connections. Thus, bond0 would now assign 50% of connections to eth1.

  • We need to find a rewrite IP address that works, though. We could try rewriting 2.2.2.2 to 2.2.2.3, 2.2.2.4, 2.2.2.5, … and very soon we would find an address that works, i.e., that makes bond0 send the packets via eth1 instead of eth0. After all, the likelihood for each different destination IP address to make bond0 go via eth1 is 50%. bond0 basically does deterministic random assignment: a given pair of source and destination IP addresses ends up on a randomly picked interface, but it’s consistently the same interface for all connections between these two IP addresses.

  • This only solves one direction: Linux -> switch. We would actually have to find an IP address that makes both, Linux as well as the switch, use the link from / to Linux’s eth1. To find this IP address, we could again try rewriting the destination IP address to 2.2.2.3, 2.2.2.4, 2.2.2.5, … until we find an IP address that makes both use the eth1 link.

  • All in all, we’d need to find an IP address that works, i.e., that makes bond0 use eth1, and we’d then rewrite the destination IP address of 50% of fabric connections. These connections would then go via eth1. The rest, the connections that aren’t rewritten, would keep going via eth0.

  • Alternatively, we could make things easier for ourselves and do this: Rewrite 10% of connections to 2.2.2.3, another 10% to 2.2.2.4, another 10% to 2.2.2.5, etc. Each of these rewrites has a 50% chance of hitting eth0 and a 50% chance of hitting eth1. So, on average, we’d expect 50% of fabric connections to hit eth0 and 50% to hit eth1.

  • This would be true for both: for Linux and for the switch. In this way, we wouldn’t have to try to pick an “IP address that works” (i.e., that sends traffic through the eth1 link on Linux as well as on the switch). By rewriting to more than 1 IP address, we could simply rely on the fact that, on average, 50% of IP addresses make bond0 pick eth0 and 50% of IP addresses make bond0 pick eth1.

  • Fault tolerance would still work. In case of a failure of either eth0 or eth1, the bonding device would simply exclusively use the surviving device, regardless of the IP address pair used by an IP packet.

Or am I missing something?

Thomas


#9

Hi Thomas, it could work. Moreover, the slave assignment should be deterministic, not random. In the documentation for the layer2+3 transmit hash policy:

https://www.kernel.org/doc/Documentation/networking/bonding.txt

the formula for the slave assignment is shown, so one could always be able to come up with specific IP addresses that would force the connections to use different links. Assuming (but this should be verified) that the slaves used by the switches to deliver traffic to the host are the same used by the host for outgoing traffic (this could be true only for 802.3ad), this should work both ways (in/out).

OR, one could make 2 active/backup bonds over 4 slaves and redirect connections across them using iptables NAT, and forget about 802.3ad completely :wink:

Thanks, Giacomo

EDIT: some similar considerations about hashing algorithms:


#10

Hi, so by now I’m opting for this setup:

bond2 is active/backup with 2 slaves and IP 192.168.0.X/28 (.1 is host 1 and .2 is host 2) bond3 is active/backup with 2 slaves and IP 192.168.1.X/28 (.1 is host 1 and .2 is host 2) (bond0 and bond1 are used for clients connections)

There are 3 iptables rule on each host:

host1:

/sbin/iptables -t nat -A POSTROUTING -o bond2 -j SNAT --to-source 192.168.0.1
/sbin/iptables -t nat -A POSTROUTING -o bond3 -j SNAT --to-source 192.168.1.1
/sbin/iptables -t nat -I OUTPUT -d 192.168.0.2 -p tcp --dport 3001 -m state --state NEW -m statistic --mode nth --every 2 --packet 0 -j DNAT --to-destination 192.168.1.2

host2:

/sbin/iptables -t nat -A POSTROUTING -o bond2 -j SNAT --to-source 192.168.0.2
/sbin/iptables -t nat -A POSTROUTING -o bond3 -j SNAT --to-source 192.168.1.2
/sbin/iptables -t nat -I OUTPUT -d 192.168.0.1 -p tcp --dport 3001 -m state --state NEW -m statistic --mode nth --every 2 --packet 0 -j DNAT --to-destination 192.168.1.1

It quite works. Although routing should be able to handle the source IP address, for some reasons (I didn’t figure them out yet), in some cases you may find connections to a 192.168.1.X IP with 192.168.0.Y source IP. The two SNAT rules should handle these cases and make things straight.

One final note: the network traffic is not evenly balanced, of course: the rules will route a half of the connections on a network link and another half on the second link, but the actual network traffic “balance” depends on the network traffic of each connection. Anyway, in my setup this was enough to offload a good 30-40% of the traffic from the first link to the second one, which is enough.

I still think it would be a good idea to do this inside aerospike, maybe being able to explicitly associate one namespace’s network traffic with a specific network link could allow people to fine-tune their network usage.

Thanks, Giacomo


#11

Hey Giacomo,

Sorry for the late response. For some reason I didn’t get notified by the discussion forum about your follow-up messages.

I’m glad to hear that things work! That’s great news. Thanks for sharing.

I also have good news for you on another front: Aerospike 3.14.x, which is out now, does load-balancing across all available fabric addresses by default. This makes the iptables exercise unnecessary.

Thomas


#12

Thanks Thomas,

I’ll try 3.14 as soon as I can.

Giacomo


#13

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.