Upgrade to 3.10 seems to have broken our Cluster


#1

Hi, so we have just upgraded to 3.10 and upon restarting both servers for some reason they no longer talk to each other. Before ripping apart the networks I wanted to just check that the config changes from old to new are correct. This was our previous working configuration:

# Aerospike database configuration file.

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 4
	transaction-queues 4
	transaction-threads-per-queue 4
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}

network {
        service {
                address any
                port 3000
                access-address 192.168.120.52
        }

        heartbeat {
                mode multicast
                address 239.1.99.51
                port 9918
                interface-address 192.168.60.52

                interval 150
                timeout 10
        }

        fabric {
                address any
                port 3001
        }

        info {
                address any
                port 3003
        }
}

namespace temp01 {
        replication-factor 2
        memory-size 8G
        default-ttl 0

        storage-engine device {
                device /dev/sdb1
                scheduler-mode noop
                write-block-size 128K
        }
}

namespace adspruce {
        replication-factor 2
        memory-size 98G
        default-ttl 0

        storage-engine device {
                device /dev/sdc1
		device /dev/sdd1
		device /dev/sde1
		device /dev/sdf1
		scheduler-mode noop
                write-block-size 128K
        }
}

and this is our new one. The configs are identical apart from the IP addresses which would be expected (right?) By this I mean we have 2 servers both running the same config, but with the IP’s being .51 and .52

# Aerospike database configuration file.

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	service-threads 4
	transaction-queues 4
	transaction-threads-per-queue 4
	proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}

network {
        service {
                address any
                port 3000
                access-address 192.168.120.51
        }

        heartbeat {
                mode multicast
                multicast-group 239.1.99.51
                port 9918
                address 192.168.60.51

                interval 150
                timeout 10
        }

        fabric {
                address 192.168.60.51
                port 3001
        }

        info {
                address any
                port 3003
        }
}

namespace temp01 {
        replication-factor 2
        memory-size 8G
        default-ttl 86400		# 86400 = 24 hours

        storage-engine device {
                device /dev/sdb1
                scheduler-mode noop
                write-block-size 128K
        }
}

namespace adspruce {
        replication-factor 2
        memory-size 98G
        default-ttl 0			# 0 = no default TTL

        storage-engine device {
                device /dev/sdc1
		device /dev/sdd1
		device /dev/sde1
		device /dev/sdf1
                scheduler-mode noop
                write-block-size 128K
        }
}

Any thoughts on why this may be happening?


#2

@rbotzer - in response to twitter :slight_smile:


#3

Thanks for sharing your configuration. It does look correct. Are there any warnings being printed in the logs? Can you share ifconfig output?


#4

Hey @anushree - nothing in the warning logs from the DevOps guy, and below is the relevant ifconfig print out for you:

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 192.168.120.51  netmask 255.255.255.0  broadcast 192.168.120.255
        ether 14:18:77:43:a8:3b  txqueuelen 0  (Ethernet)
        RX packets 43371700  bytes 4476689094 (4.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 21564909  bytes 12813459366 (11.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond60: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 192.168.60.51  netmask 255.255.255.0  broadcast 192.168.60.255
        ether 14:18:77:43:a8:3d  txqueuelen 0  (Ethernet)
        RX packets 453672  bytes 158835666 (151.4 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5323  bytes 903664 (882.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Many thanks for looking into this…


#5

Okay - all sorted - not sure what the exact issue was, but restarting the whole cluster once more seems to have resolved the issue - the old “turn it off and back on again”

Weird


#6

Interesting… Thanks for the confirmation… So, it does seem to be something around the network as I assume you did not change the configuration when you did the ASD restart?


#7

Would still like to get your logs to see if we can spot something. Thanks.


#8

@anushree - nope, we didn’t change anything just restarted the first node in the cluster - we presumed the issue was with the second node as the first came up fine, but nothing we did changed that - once we restarted the first one again all was well.

@wchu - I’ll see if I can get the DevOps guys to get them together