Upgrade from 4.8.0.3 to 4.9.0.4, high rate of write timeout

For a rolling upgrade, I added a 4.9.0.4 (ip-10-1-14-204) node into the 4.8.0.3 cluster. Even after the partition migration done, the 4.9 node still had a very high rate of write timeout. Finally I removed it from the cluster and the performance of cluster recovered as usual.

All the nodes use the same configure.

I found logs like this on 4.8 nodes

Apr 20 2020 09:01:18 GMT: WARNING (rw): (replica_write.c:255) repl_write_handle_op: bad record
Apr 20 2020 09:01:18 GMT: WARNING (flat): (flat.c:183) unsupported storage fields
Apr 20 2020 09:01:18 GMT: WARNING (rw): (replica_write.c:255) repl_write_handle_op: bad record
Apr 20 2020 09:01:18 GMT: WARNING (flat): (flat.c:183) unsupported storage fields

this on 4.9 node

Apr 20 2020 09:04:36 GMT: WARNING (rw): (replica_write.c:418) repl-write ack: no digest 
Apr 20 2020 09:04:36 GMT: WARNING (rw): (replica_write.c:418) repl-write ack: no digest
Apr 20 2020 09:04:36 GMT: WARNING (rw): (replica_write.c:418) repl-write ack: no digest
Apr 20 2020 09:04:36 GMT: WARNING (rw): (replica_write.c:418) repl-write ack: no digest

I didn’t find any attention in the doc about upgrading from 4.8 to 4.9, but it seems not safe to do the rolling upgrade as usual. I succeeded with 4.6 to 4.7, 4.7 to 4.8 without any issue.

Any help?

Could you share your aerospike.conf as well as the output of asadm -e "info".

Are you running Aerospike Community or Enterprise?

NVM we have confirmed that this is a bug in Aerospike Community Edition. We are working on a hot-fix.

BTW, this bug could have corrupted the data on that node, you should probably wipe that nodes disks and have it rejoin the 4.8 cluster. Migrations will repopulate the disks.

Thank you for reporting this issue.

Yes, we are running Aerospike CE. Good to know you located the issue. Thank you for you remind, the 4.9 node has been terminated.

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.