We are finding AEROSPIKE_ERR_RECORD_NOT_FOUND & AEROSPIKE_GET_FAILED in our C application

We are facing issue in our Aerospike cluster.

Infra detail:

Aerospike version: V4.5.2.2 C client binaries: V4.6.16

Issue description: scenario-1:

The record has been inserted by C application in Aerospike successfully, where TTL is 14 Hrs, which is confirmed by logging aerospike insert return status as success by library function. The aero-c-client library function used by C application is “aerospike_key_put()” like “if (aerospike_key_put(as, &err, NULL, &g_key, &rec) != AEROSPIKE_OK)”.

Another C application searches this inserted record from Aerospike where the record has not been found in aerospike, to process and re-insert by deleting old key and inserting the record with new key with TTL of 14 Hrs. This search is made in a few milliseconds post insert. The aero-c-client library function used by C application is “aerospike_key_get()” like “if (aerospike_key_get(p_as, &err, NULL, g_key, &p_rec) != AEROSPIKE_OK)”.

This aerospike record fetch failure occurs intermittently, which is nearly 1-10% of total inserts and fetch / get done by C applications and is not for all records.

Issue description: scenario-2:

The record has been inserted by C application in Aerospike successfully, where TTL is 14 Hrs, which is confirmed by logging aerospike insert return status as success by library function. The aero-c-client library function used by C application is “aerospike_key_put()” like “if (aerospike_key_put(as, &err, NULL, &g_key, &rec) != AEROSPIKE_OK)”.

Another C application fetches the record and re-insert the changed key with TTL of 14 Hrs, as per point 2 in the above scenario. The aero-c-client library function used by C application is “aerospike_key_get()” like “if (aerospike_key_get(p_as, &err, NULL, g_key, &p_rec) != AEROSPIKE_OK)”.

Post 12 Hrs another C application post 12th Hrs searches Aerospike for fetching the record details where the record has not been found in aerospike due to which the message was unable to process further.The aero-c-client library function used by C application is “aerospike_key_get()” like “if (aerospike_key_get(p_as, &err, NULL, g_key, &p_rec) != AEROSPIKE_OK)”.

This aerospike record fetch failure occurs intermittently, which is nearly 1-10% of total inserts and fetch / get done by C applications and is not for all records.

Configuration File:

service {
     paxos-single-replica-limit 1
        proto-fd-max 90000
        service-threads 4
        transaction-queues 4
        transaction-threads-per-queue 4
        query-in-transaction-thread true}

logging {

  file /var/log/aerospike/aerospike.log {
    
context any info

  }

  console {
    context any warning

  }
}

network {

        service {
                address any
                port 3000
        }

        heartbeat
 {

                mode mesh
                #multicast-group 239.1.99.222
                port 3002
                mesh-seed-address-port IP-address_1 3002
                mesh-seed-address-port IP-address_2 3002
                mesh-seed-address-port IP-address_3 3002
                mesh-seed-address-port IP-address_4 3002

               

                interval 150
                timeout 10
        }

        fabric 
{
                port 3001

        }

        info
 {
               
 port 3003

        }
}

namespace dup {

        replication-factor 2
        memory-size 10G
        default-ttl 20M # 30 days, use 0 to never expire/evict.
        data-in-index true
        single-bin true
	    write-commit-level-override=all
	    read-consistency-level-override=all
        storage-engine device {
        file /u01/aerospike/data/prod_dup.dat
        filesize 40G
        data-in-memory true
        }
}

namespace tran {

        replication-factor 2
        memory-size 110G
        default-ttl 18H # 30 days, use 0 to never expire/evict.
	    write-commit-level-override=all
	    read-consistency-level-override=all
        storage-engine memory
}

Unrelated to these issues, you have two logging contexts defined which could lead to undefined behavior.

Seems the file defined in the first logging context is redefined in the second, so the first context should be removed.

Could you provide the output of:

asadm -e info

hi, sorry it was just copy issue actual logging is enabled like below:

logging {
  file /var/log/aerospike/aerospike.log {
   context any info
  }

  console {
    context any warning
  }
}

In case there is some issue with this, please let me know.

The “tran” namespace has evicted records in the past. Eviction is basically early expiration when HWM configurations are breached (see the link for details). I believe this would explain the behavior you are seeing.

You can configure some a set to not be evictable, details can be found here: https://www.aerospike.com/docs/operations/configure/namespace/retention/index.html#set-disable-eviction.

Hi,

We have made necessary changes recommended by you, will let you know if we find any improvement.

Thanks for your help.

But changes recommended by you may justify our issue scenario 2, but not scenario 1 as some times we are not able to fetch record within few milliseconds after insertion whereas TTL is set to 14 hours.

This may mean that you either have a lot of records that are non-expirable or are set to expire at a much later date. In either case, the records with the shortest TTL (time-to-live) will be prioritized for eviction (unless the set is configured to not evict).

© 2015 Copyright Aerospike, Inc. | All rights reserved. Creators of the Aerospike Database.