One node (of 6) has integrity problem after a crash and reboot. Will not recover

bucksense_it_team · January 16, 2023, 9:36pm

We have one node (aerospike-amc-community-3.6.13) that was functioning for years with no problem. The machine rebooted because of power problems but the node is unable to go able into the cluster. We tried several reboots but the node cannot get in sync.

Here some logs

Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:1487) sending sync message to bb948d83bcb2b78
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:1487) sending sync message to bb9070a6a52aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:1496) SUCCESSION [1673904895]@bb9fbe66e52aed4: bb9fbe66e52aed4 bb9dae56e52aed4 bb9d8f36e52aed4 bb97e3d6752aed4 bb948d83bcb2b78 bb9070a6a52aed4 
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb9d8f36e52aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb9070a6a52aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb97e3d6752aed4
Jan 16 2023 21:34:55 GMT: INFO (paxos): (paxos.c:3131) Received paxos partition sync request from bb948d83bcb2b78
Jan 16 2023 21:34:56 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:56 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:56 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:2404) Corrective changes: 0. Integrity fault: true
Jan 16 2023 21:34:57 GMT: INFO (paxos): (paxos.c:2421) Paxos round running. Skipping succession list fix.
Jan 16 2023 21:34:58 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:58 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:58 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:2404) Corrective changes: 0. Integrity fault: true
Jan 16 2023 21:34:59 GMT: INFO (paxos): (paxos.c:2421) Paxos round running. Skipping succession list fix.
Jan 16 2023 21:35:00 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {buck_conversion} /dev/nvme0n1p4: used-bytes 0 free-wblocks 171891 write-q 0 write (0,0.0) defrag-q 0 defrag-read (1,0.0) defrag-write (0,0.0)
Jan 16 2023 21:35:00 GMT: INFO (paxos): (paxos.c:2541) Cluster Integrity Check: Detected succession list discrepancy between node bb9dae56e52aed4 and self bb9fbe66e52aed4
Jan 16 2023 21:35:00 GMT: INFO (paxos): (paxos.c:268) Paxos List [bb9fbe66e52aed4,bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]
Jan 16 2023 21:35:00 GMT: INFO (paxos): (paxos.c:268) Node List [bb9dae56e52aed4,bb9d8f36e52aed4,bb97e3d6752aed4,bb948d83bcb2b78,bb9070a6a52aed4]

The connectivity between nodes is working fine and the machine is able to reach all other nodes.

We also get this WARNING

Jan 16 2023 21:52:56 GMT: WARNING (hb): (hb.c:8128) Ignoring adding self 10.200.5.106:3002 as mesh seed 
Jan 16 2023 21:52:56 GMT: INFO (hb): (hb.c:1852) Duplicate mesh seed node from config 10.200.5.106:3002
Jan 16 2023 21:52:56 GMT: INFO (fabric): (fabric.c:1623) Updated fabric published address list to {10.200.5.106:3001}
Jan 16 2023 21:52:56 GMT: INFO (paxos): (paxos.c:153) cluster_key set to 0xf3e8f3f3bcca751c
Jan 16 2023 21:52:56 GMT: INFO (partition): (partition_balance.c:235) {buck_ingester} 4096 partitions: found 2721 absent, 1375 stored
Jan 16 2023 21:52:56 GMT: INFO (partition): (partition_balance.c:235) {buck_banker} 4096 partitions: found 2721 absent, 1375 stored
Jan 16 2023 21:52:56 GMT: INFO (partition): (partition_balance.c:235) {buck_conversion} 4096 partitions: found 2721 absent, 1375 stored
Jan 16 2023 21:52:56 GMT: INFO (paxos): (paxos.c:3604) Paxos service ignited: bb9fbe66e52aed4
Jan 16 2023 21:52:57 GMT: INFO (batch): (batch.c:588) Initialize batch-index-threads to 4

Any ideas?

bucksense_it_team · January 16, 2023, 9:56pm

(post deleted by author)

kporter · January 17, 2023, 1:17am

What version of aerospike-server is this, the AMC version isn’t the same thing.

asadm -e info

Since paxos.c is in the logs, I believe that means the server is prior to 3.14 (or possibly 3.13 pre ‘jump’). Which is way out of our support window and the code base is so different than today, it would be very difficult to find people in the community that could assist.

The software has changed a lot in the last ~8 years - a lot of effort has been placed into resilience. That said, a huge stride towards resilience was taken with the changes in 3.13 after the user has followed the protocol jump procedures.

kporter · January 17, 2023, 1:22am

The warning looks like another node is advertising the same ip as the node logging the warning. You may be able to identify which by running:

asadm -e "asinfo -v service"

It may mean that you have configured the node to send heartbeats to itself. Not sure if that used to cause problems or not.

bucksense_it_team · January 17, 2023, 12:27pm

Hi Kevin

Thanks for getting back to me!

i ran the command on the server that today is out of the cluster and i got that

asadm -e “asinfo -v service”

ingester-36:3000 (10.200.5.106) returned: 10.200.5.106:3000

if i ran the same command on other server i got this

asadm -e “asinfo -v service” ingester-01:3000 (10.200.5.101) returned: 10.200.5.101:3000

10.200.5.103:3000 (10.200.5.103) returned: 10.200.5.103:3000

10.200.5.105:3000 (10.200.5.105) returned: 10.200.5.105:3000

10.200.5.102:3000 (10.200.5.102) returned: 10.200.5.102:3000

10.200.5.104:3000 (10.200.5.104) returned: 10.200.5.104:3000

this is my configuration of the server that today doesn’t work

network { service { address 10.200.5.106 port 3000 access-address 10.200.5.106 }

    heartbeat {
            mode mesh
            port 3002
            mesh-seed-address-port 10.200.5.101 3002
            mesh-seed-address-port 10.200.5.102 3002
            mesh-seed-address-port 10.200.5.103 3002
            mesh-seed-address-port 10.200.5.104 3002
            mesh-seed-address-port 10.200.5.105 3002
            mesh-seed-address-port 10.200.5.106 3002
            interval 150
            timeout 10
    }

    fabric {
            address 10.200.5.106
            port 3001
    }

    info {
            address 10.200.5.106
            port 3003
    }

}

Thanks

MB

system · January 17, 2024, 12:28pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem cluster integrity false on aerospike enterprise 3.9 Configuration	0	1378	September 3, 2016
Data inconsistency after failed node back Tuning	7	4165	November 14, 2014
Cluster integrity fault Operations	1	2128	January 24, 2016
Cluster Integrity Check: Detected succession list discrepancy at Google Cloud Google Compute Engine (GCE)	2	3930	July 10, 2015
Aerospike Node Entering and Exiting the Cluster Frequently Configuration	9	1950	July 1, 2017

One node (of 6) has integrity problem after a crash and reboot. Will not recover

Related topics