Sets value suddenly changed

I have system which accomodate balance of user using Aerospike, its using 5 server cluster and im using 3.15.0.2 . Its already running since 2020 and no problem at all.

Suddenly at 30 january 15.00 to 31 january 07.00 (the exact time is unknown) . one of record suddenly changed from 9775265 to 9280. i makes sure its not problem in application because no log request about that at all. My suspect is at that time 31 january 00.10 the server shutdown because datacenter power change.

My question is : What should i do to prevent this issue happened again incase there are another shutdown in the future?

Here’s server configuration :

# Aerospike database configuration file for use with systemd.

service {
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        proto-fd-max 15000
#       transaction-pending-limit 50
        transaction-pending-limit 0
}

logging {
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
#               mode multicast
#               multicast-group 239.1.99.222
#               port 9918

                # To use unicast-mesh heartbeats, remove the 3 lines above, and see
                # aerospike_mesh.conf for alternative.
                mode mesh
                address 192.168.0.221
                port 3002 # Heartbeat port for this node.

                # List one or more other nodes, one ip-address & port per line:
                mesh-seed-address-port 192.168.0.221 3002
                mesh-seed-address-port 192.168.0.222 3002
                mesh-seed-address-port 192.168.0.223 3002
                mesh-seed-address-port 192.168.0.224 3002
                mesh-seed-address-port 192.168.0.232 3002
#               mesh-seed-address-port 10.10.10.13 3002
#               mesh-seed-address-port 10.10.10.14 3002

                interval 150
                timeout 40
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

namespace test {
        replication-factor 2
        memory-size 4G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine memory
}

namespace bar {
        replication-factor 2
        memory-size 4G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

        storage-engine memory

        # To use file storage backing, comment out the line above and use the
        # following lines instead.
#       storage-engine device {
#               file /opt/aerospike/data/bar.dat
#               filesize 16G
#               data-in-memory true # Store data in memory in addition to file.
#       }
}

namespace billing_gateway_ns {
        replication-factor 2
        memory-size 28G
        default-ttl 0d # 30 days, use 0 to never expire/evict.

#        storage-engine memory
        storage-engine device {
                device /dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S4FMNE0M801945V

                write-block-size 128K
        }

}

Thank you before *my english is not that good, sorry

1 - The record that “changed” is in namespace billing_gateway_ns. Yes?

2 - If 1 is yes, are you deleting the record in the application? Yes?

3 - If 2 is yes, are you deleting it by setting expiration to a small value? or using delete api?

4 - I assume you are using Community Edition. Yes?

  1. yes, its in billing_gateway_ns
  2. no, there are no active create/delete/change between 30 jan 15.00 to 31 jan 07.00
  3. yes we’re using community edition

Regarding delete - not just between 30 jan 15.00 to 31 jan 07.00 - do you ever delete the records? If so, how? Aerospike will not randomly change your record. I am just wondering if any of the situations discussed in this KB apply to your situation.

Im not deleting any record at that time, i mean, the system even no flow to delete the record. since 2020 im not accessing aql

First, you are using a 6 year old server version and secondly, I am unable to understand your application or usage details of how you could possibly end up with the symptoms you are seeing.

I am not sure how to help you.

its made to save user balance / credit, every transaction will decrease certain amount of user credit. between 30 jan 15.00 to 31 jan 07.00 there is no usage at all except cron to read and log all remaining credit each day. the the first log is 30 jan around 15.00 - one of record shows 9775265 , the second log is 31 jan around 7.00 - the same record shows 9280. no activity for this record at that time.

my suspect is how server is shutted down at 31 jan 00.10, does just doing “halt -p” safe to aerospike data?

The link I sent you shows how in Community Edition, if you happen to coldstart nodes, depending on how your application was written, you can bring back older version of the record to life. Hence, it is important to understand what the application does. Not just during the short time during your shutdown and restart event.

But server does not corrupt records on shutdown.

We do have Strong Consistency mode in Enterprise Edition that is robust for such account type transactions.

alright, this is how we handle data write in those sets, we dont use other command to altering value. i hope this will give you better understanding about our app, we using PHP

$aeroDB = new Aerospike($CONF['aerospike_server']);
$name_space = $CONF['aerospike_namespace'];
$sets = "sets_cbp";
$pk_sets = "test_user_id";
$aeroDB->initKey($name_space,$sets,$pk_sets);
$option = [
    Aerospike::OPT_POLICY_KEY => Aerospike::POLICY_KEY_SEND
];
$formatAmountToFloat = -200.0;
$aeroDB->increment($key, 'user_bal', $formatAmountToFloat, $options);

OK, so you never delete or force expire the records and based on default-ttl = 0, your records never expire.

When you say:

one of record suddenly changed from 9775265 to 9280

what does 9775365 refer to? a value stored in a bin in the record?

If that is a bin value, I don’t know of any way how that could have happened.