Upgrade to 3.10 seems to have broken our Cluster

Hi, so we have just upgraded to 3.10 and upon restarting both servers for some reason they no longer talk to each other. Before ripping apart the networks I wanted to just check that the config changes from old to new are correct. This was our previous working configuration:

# Aerospike database configuration file.

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 4
	transaction-queues 4
	transaction-threads-per-queue 4
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}

network {
        service {
                address any
                port 3000
                access-address 192.168.120.52
        }

        heartbeat {
                mode multicast
                address 239.1.99.51
                port 9918
                interface-address 192.168.60.52

                interval 150
                timeout 10
        }

        fabric {
                address any
                port 3001
        }

        info {
                address any
                port 3003
        }
}

namespace temp01 {
        replication-factor 2
        memory-size 8G
        default-ttl 0

        storage-engine device {
                device /dev/sdb1
                scheduler-mode noop
                write-block-size 128K
        }
}

namespace adspruce {
        replication-factor 2
        memory-size 98G
        default-ttl 0

        storage-engine device {
                device /dev/sdc1
		device /dev/sdd1
		device /dev/sde1
		device /dev/sdf1
		scheduler-mode noop
                write-block-size 128K
        }
}

and this is our new one. The configs are identical apart from the IP addresses which would be expected (right?) By this I mean we have 2 servers both running the same config, but with the IP’s being .51 and .52

# Aerospike database configuration file.

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 4
	transaction-queues 4
	transaction-threads-per-queue 4
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}

network {
        service {
                address any
                port 3000
                access-address 192.168.120.51
        }

        heartbeat {
                mode multicast
                multicast-group 239.1.99.51
                port 9918
                address 192.168.60.51

                interval 150
                timeout 10
        }

        fabric {
                address 192.168.60.51
                port 3001
        }

        info {
                address any
                port 3003
        }
}

namespace temp01 {
        replication-factor 2
        memory-size 8G
        default-ttl 86400		# 86400 = 24 hours

        storage-engine device {
                device /dev/sdb1
                scheduler-mode noop
                write-block-size 128K
        }
}

namespace adspruce {
        replication-factor 2
        memory-size 98G
        default-ttl 0			# 0 = no default TTL

        storage-engine device {
                device /dev/sdc1
		device /dev/sdd1
		device /dev/sde1
		device /dev/sdf1
                scheduler-mode noop
                write-block-size 128K
        }
}

Any thoughts on why this may be happening?

@rbotzer - in response to twitter :slight_smile:

Thanks for sharing your configuration. It does look correct. Are there any warnings being printed in the logs? Can you share ifconfig output?

Hey @anushree - nothing in the warning logs from the DevOps guy, and below is the relevant ifconfig print out for you:

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 192.168.120.51  netmask 255.255.255.0  broadcast 192.168.120.255
        ether 14:18:77:43:a8:3b  txqueuelen 0  (Ethernet)
        RX packets 43371700  bytes 4476689094 (4.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 21564909  bytes 12813459366 (11.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond60: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 192.168.60.51  netmask 255.255.255.0  broadcast 192.168.60.255
        ether 14:18:77:43:a8:3d  txqueuelen 0  (Ethernet)
        RX packets 453672  bytes 158835666 (151.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5323  bytes 903664 (882.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Many thanks for looking into this…

Okay - all sorted - not sure what the exact issue was, but restarting the whole cluster once more seems to have resolved the issue - the old “turn it off and back on again”

Weird

Interesting… Thanks for the confirmation… So, it does seem to be something around the network as I assume you did not change the configuration when you did the ASD restart?

Would still like to get your logs to see if we can spot something. Thanks.

@anushree - nope, we didn’t change anything just restarted the first node in the cluster - we presumed the issue was with the second node as the first came up fine, but nothing we did changed that - once we restarted the first one again all was well.

@wchu - I’ll see if I can get the DevOps guys to get them together