Duplicate records

I am using C Client API to store in Aerospike, backed by nvme ssd partitions. C client API wrote only 1 GB, but ssd storage usage shows 2GB i.e. twice the size of what I actually wrote. Replication factor is 1 and only 1 instance since I have been trying to isolate the issue.

Any ideas what I can look for? what is the 2nd row info?

asadm Seed: [(‘127.0.0.1’, 3000, None)] Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf Aerospike Interactive Shell, version 0.1.23

Found 1 nodes Online: 1.1.1.1:3000

Admin> info

                      Node               Node                  Ip       Build   Cluster   Migrations       Cluster     Cluster         Principal   Client     Uptime
                         .                 Id                   .           .      Size            .           Key   Integrity                 .    Conns          .
myinstancehostname:3000   *BB9370AB23E1600   10.240.45.77:3000   C-4.5.0.2         1      0.000     A1848673B60   True        BB9370AB23E1600        4   04:00:40
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2020-03-13 03:18:36 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Namespace                         Node     Total   Expirations,Evictions     Stop       Disk    Disk     HWM   Avail%          Mem     Mem    HWM      Stop
          .                            .   Records                       .   Writes       Used   Used%   Disk%        .         Used   Used%   Mem%   Writes%
data          myinstancehostname:3000   4.370 K   (0.000,  0.000)         false    1.003 GB   1       50      99       608.930 KB   1       60     90
data                                       4.370 K   (0.000,  0.000)                  1.003 GB                            608.930 KB
transaction   myinstancehostname:3000   0.000     (0.000,  0.000)         false         N/E   N/E     50      N/E       72.000 KB   1       60     90
transaction                                0.000     (0.000,  0.000)                  0.000 B                              72.000 KB
Number of rows: 4

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2020-03-13 03:18:36 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Namespace                         Node     Total     Repl                      Objects                   Tombstones             Pending   Rack
          .                            .   Records   Factor   (Master,Prole,Non-Replica)   (Master,Prole,Non-Replica)            Migrates     ID
          .                            .         .        .                            .                            .             (tx,rx)      .
data          myinstancehostname:3000   4.370 K   1        (4.370 K, 0.000,  0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
data                                       4.370 K            (4.370 K, 0.000,  0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)
transaction   myinstancehostname:3000   0.000     1        (0.000,  0.000,  0.000)      (0.000,  0.000,  0.000)      (0.000,  0.000)     0
transaction                                0.000              (0.000,  0.000,  0.000)      (0.000,  0.000,  0.000)      (0.000,  0.000)
Number of rows: 4

------------aerospike.conf
service {
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    proto-fd-max 15000
    #proto-fd-idle-ms 15000 # 15sec, default is 1 minute
}
network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode multicast
        multicast-group 239.1.99.222
        port 9918

        # To use unicast-mesh heartbeats, remove the 3 lines above, and see
        # aerospike_mesh.conf for alternative.

        interval 150
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace transaction {
    replication-factor 1
    memory-size 64G
    default-ttl 0d # 0 to never expire/evict.
    storage-engine memory
}

namespace data {
    replication-factor 1
    memory-size 64G
    default-ttl 0d # 0 to never expire/evict.
    storage-engine device {
        max-write-cache 10G
        write-block-size 1M
        device  /dev/nvme2n1p1
        device  /dev/nvme2n1p2
        device  /dev/nvme2n1p3
        device  /dev/nvme2n1p4
    }
}

sorry, how did you get 2gb from this?

The device shows below but the disk used above shows 1.003GB /dev/nvme2n1 S332NXAH300724 SAMSUNG MZQLW1T9HMJP-000ZU 1 2.10 GB / 1.92 TB 512 B + 0 B CXV8301Q

And, when I write to the raw device directly, disk usage is correct but when I go through aerospike to write to the same raw device, actual disk usage is twice the size of what asadm reports.

I don’t quite understand the asadm output below. What’s the 2nd line indicating?

  Namespace                         Node     Total   Expirations,Evictions     Stop       Disk    Disk     HWM   Avail%          Mem     Mem    HWM      Stop
          .                            .   Records                       .   Writes       Used   Used%   Disk%        .         Used   Used%   Mem%   Writes%
data          myinstancehostname:3000   4.338 K   (0.000,  0.000)         false    1.003 GB   1       50      99       606.038 KB   1       60     90
data                                       4.338 K   (0.000,  0.000)                  1.003 GB                            606.038 KB

Thats a summary of all your nodes.

<ns> <node1> <diskused>
<ns> <node2> <diskused>
<ns>         <total disk used>

Another way to look at it that might help is to run summary in asadm

Thanks Albot for your quick responses. There is no node2. Its a single node instance. Second line in the output doesn’t have a node., its empty.

We are writing 1G, it shows 1GB per line. 2nd line - node is empty. /dev/nvme* disk usage shows 2GB. Trying to figure out where this additional 2GB is coming from Please share any ideas. Its been really time consuming to figure this out.

Its a summary of all your nodes. All 1 of them.

@kkb, are you calculating the 2GiB from asadm output or are you running a different command such as du. We are only seeing 1GiB used per asadm output. Also note that du doesn’t work when the raw device is configured, any existing filesystem will be overwritten.

I am using nvme list, sample output below. I am using C Client API, any policy settings that would cause Aerospike to write an extra copy to the drive?. Server side - replica factor is 1.

Node SN Model Namespace Usage Format FW Rev


/dev/nvme2n1 S332NXAH300471 SAMSUNG MZQLW1T9HMJP-000ZU 1 2.13 GB / 1.92 TB 512 B + 0 B CXV8301Q

No, there aren’t any policy for Aerospike to write more than one copy to the device. Aerospike only keeps one live copy on any node regardless of your configuration.

I haven’t used the nvme command before nor am I aware how it could interpret the amount of space Aerospike is using. It is possible that Aerospike has written to the 2 GiB offset which this tool interprets as 2 GiB used - but this doesn’t mean that 2 GiB are currently in use by Aerospike.

For updates, Aerospike uses a copy on write. The prior copy is eventually returned back to the free pool when the block becomes eligible for defrag. I’d recommend learning more about the defragmentation process.

I tried with a file instead of nvme. Its the same issue. For some reason, asadm shows correct disk usage but not the actual size of the file. Suspect aerospike is writing more data for some reason.
I wrote only 1 GB – du -h /var/lib/aerospike/db.dat 2.0G /var/lib/aerospike/db.dat The data received on the network interface is 1GB.

Everytime I write some data to aerospike, it is always twice the size of actual data written. asdm certainly reports correct. However, actual storage usage reported by du is twice. Even if there is some offset you are seeking to and writing data, how can it be twice any size written?

I have provided aerospike.conf in the first post. Please check.

When you say you “writing data”: Are you creating new records or updating existing records?

a) Create:  key1:data, key2:data, key3:data ....
b) Update: key1: data_v0  --> key1:data_v1 --> key1:data_v2....

What is your “write”? a) or b)?

It’s create only. aerospike_key_operate_async() is used with multiple bins. It looks like server is flushing the block partially filled even if the block is 60% full. If I use C benchmarks tool, it works as expected, write blocks are flushed only after its full.

Something is causing server to flush write blocks prematurely with padded zeros. What are the conditions for flushing the write blocks even though they are partially filled?

If the write block size is 1MB, what’s the recommendation for clients who have partial block data to avoid premature flushes on the server?

BTW - between T1 and T2, there should be at least one new write for T2 flush to happen… just a minor point.

I would think the SSD controller will just write the 1MB w-b in a new set of 4KB blocks (don’t think it can overwrite existing blocks) and mark them as our Block#1 - but that is outside Aerospike’s realm. i.e. when my slide says “overwriting on device” … that does not mean SSD controller is actually overwriting over the same NAND cells - as far as Aeropsike is concerned, the SSD controller is logically giving us our “block#1” as overwritten with the additional data. Regardless, the SSD controller’s own defrag routine should recover this space - again transparent to Aerospike and nothing to do with Aeropsike’s own “defragging” of wrtie-block size allocations on the SSD. Typically, SSDs give best performance at Aerospike’s 128KB w-b size.

Thank you for the details @pgupta . In my case, there is high percentage of ‘flush because unable to fit the incoming record’. This is increasing the device usage twice the size pretty much and also additional processing to get new buffers, etc. Looks like the onus is on the client to make sure server’s write write-block-size is optimally filled to get better performance and optimal device usage.

But one way or the other it is recovered back by defrag. So, I don’t quite understand at what level you are able to see 2x usage but it is not permanent loss of storage. So, really should be a non-issue.

There are multiple issues with partial flushes -

  1. write-q full error because flushes go to the write-q to be flushed. WARNING (drv_ssd): (drv_ssd.c:3636) {data} write fail: queue too deep: exceeds max 5120
  2. Since there is ‘zeros’ padding, not sure SSD firmware itself would do any defragmentation.
  3. For every write transaction leads to a flush is an overhead. Even if I change write-block-size to 128k, I would still run into this issue. Below snippet of aerospike logs(instrumented) - Remaining size is padded with zeros. Too much overhead. (drv_ssd.c:1551) write_sz 1048816, wblock size 2097152, pos 1048816 remaining 1048336 (drv_ssd.c:1551) write_sz 1048816, wblock size 2097152, pos 1052704 remaining 1044448

Any ideas? May be client side can do some rounding. Client is also adding about 100bytes to the record. Just curious why server cannot support write size greater than block size.

1 - you can’t be flushing the same block multiple times and having write-q too deep error. those are two orthogonal conditions. the only possible situation this can happen is if your drive is some kind of a SAN volume and is hung up - neither responds with a failure nor completes the write.

2 - SSD firmware - what it does is internal to the drive. does not affect Aeropsike.

3 - Every write leads to a flush if and only if you are writing just one record per second. Aerospike is designed to handle many 1000s of record writes per second per node. One write per second is an unusual use case for Aerospike.

4 - server cannot split a record across multiple w-blocks when storing on disk…hence max record size (record + overhead) is limited to w-b size.

5 - there is record overhead which adds bytes and then records are written at 16 byte boundaries on disk.

1 - not sure I understand what you mean by same block.

3 - In my case, there is another condition i.e. if the incoming record doesn’t fit, then it gets flushed. For eg., Client writes - 32k + 100Bytes Server write block size - 64k - Fills the block with data of size 32k+100Bytes - New data arrives with another 32k + 100Bytes - current block’s remaining size is <32k, it cannot fill and puts it in write-q for flushing. The block lost about 31k. - Server repeats above for every client write of 32k + 100Bytes. Please check the code snippet below. Every client write of 32k+100Bytes is landing below as I instrumented the code -

ssd_buffer_bins() {
       ........
       ........
       // Check if there's enough space in current buffer - if not, enqueue it to
       // be flushed to device, and grab a new buffer.
       if (write_sz > ssd->write_block_size - swb->pos) {
               // Enqueue the buffer, to be flushed to device.
              cf_queue_push(ssd->swb_write_q, &swb);
              cur_swb->n_wblocks_written++;
       .....
 }

1 - see the slide i posted - thats where you are flushing the same w-b multiple times - in that example I have a 48byte record and 128KB write block - same block gets flushed multiple times before its full because write rate is very slow - takes multiple seconds to fill it.

2 - note it goes to max-write-cache only on final flush - so max-write-cache is sitting empty (in that specific example) for a few seconds - time to flush a w-b to disk is of the order of hundreds of microseconds - worst milliseconds - blocks piling up in max-write-cache gives queue-too-deep error - so you can’t be on one hand flushing same block multiple times due to slow write rate (not going thru max write cache) and also have max-write-cache full - max-write-cache will always be flushed and zero in that case. Max-write-cache becomes useful when you have a temporary stupendously high write rate (burst load) when disk write rate can’t keep up. max-write-cache takes care of burst write load for a normally tuned/designed system.

3 - writing 32.1KB records in 64KB w-b - thats a unique situation which will cause each write-block to be flushed once, and contain exactly one record - i.e ~50% usage of disk capacity. Your only option there is to select a larger w-b size - max is 8MB. With 128KB, you will fit 3 records per block, → 75% usage of storage capacity, 1MB w-b size ==> 31 / 32 usage … better. No other way around that unless you can reduce your record size to just under 32KB with overhead. Refer to Linux Capacity Planning to see how record size = data + overhead can be calculated. BTW, 64KB w-b size works but not recommended for SSDs. SSDs perform best at 128KB. Likewise, you can set w-b size to 8MB (max) but not the best performing spot for SSDs. (Most) SSD performance peaks at 128KB typically and then falls off on either side in terms of write throughput, steeper <128KB, gentler >128KB. ACT testing reveals actual performance comparison for different w-b sizes for a particular SSD drive.