Apparent data loss with shadow devices


#1

Here’s what I’m seeing when testing out shadow devices:

Run benchmark for a little while via

./run_benchmarks -h 10.0.X.X -p 3000 -n disktest -k 1000000 -o I -wRU,50 -z20 -latency 7,1

Stop the benchmark, show sets in aql until the object count stops changing. Stop and restart asd. After startup, the object count is lower.

I’ve seen similar disappearances with individual objects I manually inserted, so I don’t think it’s just aql counts being wonky.

c3.large, xvdb is ephemeral, xvdc is EBS If I only use one disk or the other, I am not able to reproduce the problem, but using them shadowed does.

Here’s the config for the namespace:

namespace disktest {
    replication-factor 1
    memory-size 3G
    default-ttl 22d
    single-bin true
    data-in-index true

    storage-engine device {
            data-in-memory true
            scheduler-mode noop
            enable-osync true
            device /dev/xvdb /dev/xvdc
            write-block-size 1024K
    }
}

and here’s an example shell session showing the issue:

aql> show sets
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| disable-eviction | ns         | set-enable-xdr | objects | stop-writes-count | set       | memory_data_bytes | deleting | tombstones |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| "false"          | "disktest" | "use-default"  | 475612  | 0                 | "testset" | 0                 | "false"  | 0          |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
1 row in set (0.000 secs)
OK

aql> show sets
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| disable-eviction | ns         | set-enable-xdr | objects | stop-writes-count | set       | memory_data_bytes | deleting | tombstones |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| "false"          | "disktest" | "use-default"  | 513370  | 0                 | "testset" | 0                 | "false"  | 0          |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
1 row in set (0.000 secs)
OK

aql> show sets
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| disable-eviction | ns         | set-enable-xdr | objects | stop-writes-count | set       | memory_data_bytes | deleting | tombstones |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| "false"          | "disktest" | "use-default"  | 513370  | 0                 | "testset" | 0                 | "false"  | 0          |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
1 row in set (0.001 secs)
OK

aql> ^C
[ec2-user@ip-10-0-2-180 ~]$ sudo service aerospike stop
Stopping aerospike:                                        [  OK  ]
[ec2-user@ip-10-0-2-180 ~]$ sudo service aerospike start
Starting and checking aerospike:                           [  OK  ]
[ec2-user@ip-10-0-2-180 ~]$ aql
Aerospike Query Client
Version 3.10.2
C Client Version 4.1.1
Copyright 2012-2016 Aerospike. All rights reserved.
aql> show sets
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| disable-eviction | ns         | set-enable-xdr | objects | stop-writes-count | set       | memory_data_bytes | deleting | tombstones |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| "false"          | "disktest" | "use-default"  | 509567  | 0                 | "testset" | 0                 | "false"  | 0          |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
1 row in set (0.001 secs)
OK

aql> show sets
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| disable-eviction | ns         | set-enable-xdr | objects | stop-writes-count | set       | memory_data_bytes | deleting | tombstones |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| "false"          | "disktest" | "use-default"  | 509567  | 0                 | "testset" | 0                 | "false"  | 0          |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
1 row in set (0.001 secs)

#2

Could you also run “stat system” before and after restart?


#3

Sure. Object counts match what’s reported by show sets

aql> stat system
+---------------------------------+--------------------+
| name                            | value              |
+---------------------------------+--------------------+
| "cluster_size"                  | 1                  |
| "cluster_key"                   | "C30D42E78264824F" |
| "cluster_integrity"             | "true"             |
| "uptime"                        | 65573              |
| "system_free_mem_pct"           | 96                 |
| "system_swapping"               | "false"            |
| "objects"                       | 588479             |
| "sub_objects"                   | 0                  |
| "tombstones"                    | 0                  |
| "tsvc_queue"                    | 0                  |
| "info_queue"                    | 0                  |
| "delete_queue"                  | 0                  |
| "rw_in_progress"                | 0                  |
| "proxy_in_progress"             | 0                  |
| "tree_gc_queue"                 | 0                  |
| "client_connections"            | 2                  |
| "heartbeat_connections"         | 0                  |
| "fabric_connections"            | 16                 |
| "heartbeat_received_self"       | 0                  |
| "heartbeat_received_foreign"    | 0                  |
| "reaped_fds"                    | 1                  |
| "info_complete"                 | 4507               |
| "proxy_retry"                   | 0                  |
| "demarshal_error"               | 0                  |
| "early_tsvc_client_error"       | 0                  |
| "early_tsvc_batch_sub_error"    | 0                  |
| "early_tsvc_udf_sub_error"      | 0                  |
| "batch_index_initiate"          | 0                  |
| "batch_index_queue"             | "0:0,0:0,0:0,0:0"  |
| "batch_index_complete"          | 0                  |
| "batch_index_error"             | 0                  |
| "batch_index_timeout"           | 0                  |
| "batch_index_unused_buffers"    | 0                  |
| "batch_index_huge_buffers"      | 0                  |
| "batch_index_created_buffers"   | 0                  |
| "batch_index_destroyed_buffers" | 0                  |
| "batch_initiate"                | 0                  |
| "batch_queue"                   | 0                  |
| "batch_error"                   | 0                  |
| "batch_timeout"                 | 0                  |
| "scans_active"                  | 0                  |
| "query_short_running"           | 0                  |
| "query_long_running"            | 0                  |
| "sindex_ucgarbage_found"        | 0                  |
| "sindex_gc_locktimedout"        | 0                  |
| "sindex_gc_inactivity_dur"      | 0                  |
| "sindex_gc_activity_dur"        | 0                  |
| "sindex_gc_list_creation_time"  | 0                  |
| "sindex_gc_list_deletion_time"  | 0                  |
| "sindex_gc_objects_validated"   | 0                  |
| "sindex_gc_garbage_found"       | 0                  |
| "sindex_gc_garbage_cleaned"     | 0                  |
| "paxos_principal"               | "BB9922C2E40A70A"  |
| "migrate_allowed"               | "true"             |
| "migrate_partitions_remaining"  | 0                  |
| "fabric_msgs_sent"              | 0                  |
| "fabric_msgs_rcvd"              | 0                  |
+---------------------------------+--------------------+
57 rows in set (0.001 secs)
OK   
aql> 
[ec2-user@ip-10-0-2-180 ~]$ sudo service aerospike stop
Stopping aerospike:                                        [  OK  ]
[ec2-user@ip-10-0-2-180 ~]$ sudo service aerospike start
Starting and checking aerospike:                           [  OK  ]
[ec2-user@ip-10-0-2-180 ~]$ aql
Aerospike Query Client
Version 3.10.2
C Client Version 4.1.1
Copyright 2012-2016 Aerospike. All rights reserved.
aql> stat system
+---------------------------------+-------------------+
| name                            | value             |
+---------------------------------+-------------------+
| "cluster_size"                  | 1                 |
| "cluster_key"                   | "9E803155B2B13B7" |
| "cluster_integrity"             | "true"            |
| "uptime"                        | 10                |
| "system_free_mem_pct"           | 96                |
| "system_swapping"               | "false"           |
| "objects"                       | 580555            |
| "sub_objects"                   | 0                 |
| "tombstones"                    | 0                 |
| "tsvc_queue"                    | 0                 |
| "info_queue"                    | 0                 |
| "delete_queue"                  | 0                 |
| "rw_in_progress"                | 0                 |
| "proxy_in_progress"             | 0                 |
| "tree_gc_queue"                 | 0                 |
| "client_connections"            | 2                 |
| "heartbeat_connections"         | 0                 |
| "fabric_connections"            | 16                |
| "heartbeat_received_self"       | 0                 |
| "heartbeat_received_foreign"    | 0                 |
| "reaped_fds"                    | 0                 |
| "info_complete"                 | 9                 |
| "proxy_retry"                   | 0                 |
| "demarshal_error"               | 0                 |
| "early_tsvc_client_error"       | 0                 |
| "early_tsvc_batch_sub_error"    | 0                 |
| "early_tsvc_udf_sub_error"      | 0                 |
| "batch_index_initiate"          | 0                 |
| "batch_index_queue"             | "0:0,0:0,0:0,0:0" |
| "batch_index_complete"          | 0                 |
| "batch_index_error"             | 0                 |
| "batch_index_timeout"           | 0                 |
| "batch_index_unused_buffers"    | 0                 |
| "batch_index_huge_buffers"      | 0                 |
| "batch_index_created_buffers"   | 0                 |
| "batch_index_destroyed_buffers" | 0                 |
| "batch_initiate"                | 0                 |
| "batch_queue"                   | 0                 |
| "batch_error"                   | 0                 |
| "batch_timeout"                 | 0                 |
| "scans_active"                  | 0                 |
| "query_short_running"           | 0                 |
| "query_long_running"            | 0                 |
| "sindex_ucgarbage_found"        | 0                 |
| "sindex_gc_locktimedout"        | 0                 |
| "sindex_gc_inactivity_dur"      | 0                 |
| "sindex_gc_activity_dur"        | 0                 |
| "sindex_gc_list_creation_time"  | 0                 |
| "sindex_gc_list_deletion_time"  | 0                 |
| "sindex_gc_objects_validated"   | 0                 |
| "sindex_gc_garbage_found"       | 0                 |
| "sindex_gc_garbage_cleaned"     | 0                 |
| "paxos_principal"               | "BB9922C2E40A70A" |
| "migrate_allowed"               | "true"            |
| "migrate_partitions_remaining"  | 0                 |
| "fabric_msgs_sent"              | 0                 |
| "fabric_msgs_rcvd"              | 0                 |
+---------------------------------+-------------------+
57 rows in set (0.001 secs)
OK   
aql> show sets
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| disable-eviction | ns         | set-enable-xdr | objects | stop-writes-count | set       | memory_data_bytes | deleting | tombstones |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
| "false"          | "disktest" | "use-default"  | 580555  | 0                 | "testset" | 0                 | "false"  | 0          |
+------------------+------------+----------------+---------+-------------------+-----------+-------------------+----------+------------+
1 row in set (0.000 secs)
OK