One node (of 6) has integrity problem after a crash and reboot. Will not recover

We have one node (aerospike-amc-community-3.6.13) that was functioning for years with no problem. The machine rebooted because of power problems but the node is unable to go able into the cluster. We tried several reboots but the node cannot get in sync.

Here some logs

Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:1487) sending sync message to bb948d83bcb2b78
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:1487) sending sync message to bb9070a6a52aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:1496) SUCCESSION [1673904895]@bb9fbe66e52aed4: bb9fbe66e52aed4 bb9dae56e52aed4 bb9d8f36e52aed4 bb97e3d6752aed4 bb948d83bcb2b78 bb9070a6a52aed4 
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb9d8f36e52aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb9070a6a52aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb97e3d6752aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb948d83bcb2b78
Jan 16 2023 21:34:56 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:56 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:56 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:2404) Corrective changes: 0. Integrity fault: true
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:2421) Paxos round running. Skipping succession list fix.
Jan 16 2023 21:34:58 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:58 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:58 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:2404) Corrective changes: 0. Integrity fault: true
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:2421) Paxos round running. Skipping succession list fix.
Jan 16 2023 21:35:00 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {buck_conversion} /dev/nvme0n1p4: used-bytes 0 free-wblocks 171891 write-q 0 write (0,0.0) defrag-q 0 defrag-read (1,0.0) defrag-write (0,0.0)
Jan 16 2023 21:35:00 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:35:00 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:35:00 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4] 

The connectivity between nodes is working fine and the machine is able to reach all other nodes.

We also get this WARNING

Jan 16 2023 21:52:56 GMT: WARNING (hb): (hb.c:8128) Ignoring adding self 10.200.5.106:3002 as mesh seed 
Jan 16 2023 21:52:56 GMT: INFO (hb): (hb.c:1852) Duplicate mesh seed node from config 10.200.5.106:3002
Jan 16 2023 21:52:56 GMT: INFO (fabric): (fabric.c:1623) Updated fabric published address list to {10.200.5.106:3001}
Jan 16 2023 21:52:56 GMT: INFO (paxos): (paxos.c:153) cluster_key set to 0xf3e8f3f3bcca751c
Jan 16 2023 21:52:56 GMT: INFO (partition): (partition_balance.c:235) {buck_ingester} 4096 partitions: found 2721 absent, 1375 stored
Jan 16 2023 21:52:56 GMT: INFO (partition): (partition_balance.c:235) {buck_banker} 4096 partitions: found 2721 absent, 1375 stored
Jan 16 2023 21:52:56 GMT: INFO (partition): (partition_balance.c:235) {buck_conversion} 4096 partitions: found 2721 absent, 1375 stored
Jan 16 2023 21:52:56 GMT: INFO (paxos): (paxos.c:3604) Paxos service ignited: bb9fbe66e52aed4
Jan 16 2023 21:52:57 GMT: INFO (batch): (batch.c:588) Initialize batch-index-threads to 4

Any ideas?

(post deleted by author)

What version of aerospike-server is this, the AMC version isn’t the same thing.

asadm -e info

Since paxos.c is in the logs, I believe that means the server is prior to 3.14 (or possibly 3.13 pre ‘jump’). Which is way out of our support window and the code base is so different than today, it would be very difficult to find people in the community that could assist.

The software has changed a lot in the last ~8 years - a lot of effort has been placed into resilience. That said, a huge stride towards resilience was taken with the changes in 3.13 after the user has followed the protocol jump procedures.

The warning looks like another node is advertising the same ip as the node logging the warning. You may be able to identify which by running:

asadm -e "asinfo -v service"

It may mean that you have configured the node to send heartbeats to itself. Not sure if that used to cause problems or not.

Hi Kevin

Thanks for getting back to me!

i ran the command on the server that today is out of the cluster and i got that

asadm -e “asinfo -v service”

ingester-36:3000 (10.200.5.106) returned: 10.200.5.106:3000

if i ran the same command on other server i got this

asadm -e “asinfo -v service” ingester-01:3000 (10.200.5.101) returned: 10.200.5.101:3000

10.200.5.103:3000 (10.200.5.103) returned: 10.200.5.103:3000

10.200.5.105:3000 (10.200.5.105) returned: 10.200.5.105:3000

10.200.5.102:3000 (10.200.5.102) returned: 10.200.5.102:3000

10.200.5.104:3000 (10.200.5.104) returned: 10.200.5.104:3000

this is my configuration of the server that today doesn’t work

network { service { address 10.200.5.106 port 3000 access-address 10.200.5.106 }

    heartbeat {
            mode mesh
            port 3002
            mesh-seed-address-port 10.200.5.101 3002
            mesh-seed-address-port 10.200.5.102 3002
            mesh-seed-address-port 10.200.5.103 3002
            mesh-seed-address-port 10.200.5.104 3002
            mesh-seed-address-port 10.200.5.105 3002
            mesh-seed-address-port 10.200.5.106 3002
            interval 150
            timeout 10
    }

    fabric {
            address 10.200.5.106
            port 3001
    }

    info {
            address 10.200.5.106
            port 3003
    }

}

Thanks

MB