Defrag stops working?

In our cluster we are pretty often find that one of the boxes stop writes because of no blocks are available.

In log we see messages:

Jun 16 2015 11:21:59 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /dev/bcache1: used 33410034816, contig-free 333M (333 wblocks), swb-free 6, n-w 0, w-q 0 w-tot 1477350 (0.0/s), defrag-q 0 defrag-tot 147
6790 (0.0/s)
Jun 16 2015 11:21:59 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /dev/bcache0: used 33409257088, contig-free 56M (56 wblocks), swb-free 16, n-w 0, w-q 0 w-tot 1455046 (0.0/s), defrag-q 0 defrag-tot 1454
384 (0.0/s)

Which shows that defrag queue is empty and almost no space left on devices, which is not expected - we should have enough space according to data size and defrag settings.

We restart asd process and after load see:

Jun 16 2015 11:54:41 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /dev/bcache1: used 33283267712, contig-free 4529M (4529 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 1 (0.1/s), defrag-q 32196 defrag-tot 326
38 (1631.9/s)
Jun 16 2015 11:54:41 GMT: INFO (drv_ssd): (drv_ssd.c::2398) device /dev/bcache0: used 33332301952, contig-free 4135M (4135 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 1 (0.1/s), defrag-q 32556 defrag-tot 329
75 (1648.8/s)

So defrag has found blocks and start GC.

Also we see high iowait on servers with defrag problem and see a lot of timeouts. After restart and when defrag free some space iowait and timeouts go back to normal.

Does somebody know this problem and may be there is some solution for it?

Is it just defrag that has stopped? What is your write TPS during this time.

We also have run into many issues using bcache normally under high load where bcache locks up. See warning here: https://www.aerospike.com/docs/operations/plan/ssd/bcache/.

To work around this issue, 3.5.12 introduced shadow devices. A shadow device only receives writes, reads take place on the primary device.

In a future documentation push, device will explain how to configure the shadow device, which is really simple to do:

In your Aerospike.conf in the namespace context:

namespace NAMESPACE_NAME {
    # ... Namespace Config
    storage-engine device {
         device /dev/PRIMARY_DEVICE /dev/SHADOW_DEVICE
         # ... Storage Engine Config
    }
}

We continue to get this error after switching to shadow device. At some point number of available block decrease very quickly and node stop working. After restart node reports that a lot of block found for GC and after defrag continue to work fine.

Could you provide the logs for when the node was starting up? I’m mainly interested in the ~200 lines after the diamonds “<><>” You may need to sanatize IP address and namespace names from this output

grep -A200 "<><>" /var/log/aerospike/aerospike.log

200 log line you asked us mostly looks like

Jul 30 2015 15:30:35 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 26374584 records, 0 subrecords, /dev/sdb 7%, /dev/sdc 7%
Jul 30 2015 15:30:37 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 26784577 records, 0 subrecords, /dev/sdb 7%, /dev/sdc 7%
Jul 30 2015 15:30:39 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 27202362 records, 0 subrecords, /dev/sdb 7%, /dev/sdc 7%
Jul 30 2015 15:30:41 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 27616247 records, 0 subrecords, /dev/sdb 7%, /dev/sdc 7%
Jul 30 2015 15:30:43 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 28036685 records, 0 subrecords, /dev/sdb 7%, /dev/sdc 7%
Jul 30 2015 15:30:45 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 28444454 records, 0 subrecords, /dev/sdb 7%, /dev/sdc 7%
Jul 30 2015 15:30:47 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 28879596 records, 0 subrecords, /dev/sdb 8%, /dev/sdc 7%
Jul 30 2015 15:30:49 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 29295119 records, 0 subrecords, /dev/sdb 8%, /dev/sdc 8%

repeated again and again (more than 200 lines). You can ask till which log records you want to see log messages and we extract them for you.

What we are telling about is

Jul 27 2015 20:49:11 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/sdb: used 36892501760, contig-free 8887M (8887 wblocks), swb-free 6, n-w 0, w-q 0 w-tot 4223550 (9.8/s), defrag-q 0 defrag-tot 42350
02 (9.9/s)
Jul 27 2015 20:49:11 GMT: INFO (drv_ssd): (drv_ssd.c::2540) shadow device /dev/xvdf: w-q 0
Jul 27 2015 20:49:11 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/sdc: used 36878672384, contig-free 8812M (8812 wblocks), swb-free 11, n-w 0, w-q 0 w-tot 4221670 (9.5/s), defrag-q 0 defrag-tot 4234
051 (9.6/s)
Jul 27 2015 20:49:11 GMT: INFO (drv_ssd): (drv_ssd.c::2540) shadow device /dev/xvdg: w-q 0
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4796)  system memory: free 18102192kb ( 58 percent free )
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4804)  migrates in progress ( 26 , 0 ) ::: ClusterSize 18 ::: objects 177731443 ::: sub_objects 0
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4812)  rec refs 178376469 ::: rec locks 4 ::: trees 0 ::: wr reqs 3 ::: mig tx 26 ::: mig rx 2
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4818)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4828)    trans_in_progress: wr 3 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (12105, 11005126, 10993021) : hb (-356, 47543, 47899) :
fab (254, 972, 718)
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4830)    heartbeat_received: self 0 : foreign 170813848
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4831)    heartbeat_stats: bt 0 bf 145082923 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 478 mrf 0 eh 28 efd 23 efa 5 um 0 mcf 47477 rc
 47421
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4844)    tree_counts: nsup 1 scan 0 batch 0 dup 0 wprocess 0 migrx 2 migtx 26 ssdr 0 ssdw 0 rw 3
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4885) namespace foo: disk inuse: 73771172608 memory inuse: 11374812352 (bytes) sindex memory inuse: 0 (bytes) avail pct 11 cache-read pct 55.00
Jul 27 2015 20:49:11 GMT: INFO (info): (thr_info.c::4905)    partitions: actual 245 sync 0 desync 0 zombie 26 wait 0 absent 3825
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: reads (48038994 total) msec
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (00: 0034356586) (01: 0007956788) (02: 0002688578) (03: 0001750474)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (04: 0000999799) (05: 0000244669) (06: 0000035923) (07: 0000004957)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (08: 0000000887) (09: 0000000288) (10: 0000000042) (11: 0000000003)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (1750057346 total) msec
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (00: 1421882761) (01: 0176704609) (02: 0064845894) (03: 0042222062)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (04: 0025069611) (05: 0007903213) (06: 0003608738) (07: 0002537941)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (08: 0001579933) (09: 0001597187) (10: 0001995733) (11: 0000107957)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::163)  (12: 0000000933) (13: 0000000415) (14: 0000000359)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: proxy (63410 total) msec
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (00: 0000005704) (01: 0000007792) (02: 0000008342) (03: 0000006467)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (04: 0000005471) (05: 0000005234) (06: 0000009165) (07: 0000009225)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (08: 0000004022) (09: 0000001456) (10: 0000000524) (11: 0000000008)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (1747914247 total) msec
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (00: 1422005786) (01: 0176632455) (02: 0064806796) (03: 0042214411)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (04: 0025065306) (05: 0007899788) (06: 0003604323) (07: 0002531475)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::154)  (08: 0001578660) (09: 0001422158) (10: 0000142536) (11: 0000010476)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::163)  (12: 0000000077)
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Jul 27 2015 20:49:11 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Jul 27 2015 20:49:20 GMT: WARNING (as): (signal.c::170) SIGTERM received, shutting down
Jul 27 2015 20:49:20 GMT: INFO (storage): (storage.c::817) initiating storage shutdown ...
Jul 27 2015 20:49:20 GMT: INFO (storage): (storage.c::830) flushing data to storage ...
Jul 27 2015 20:49:20 GMT: INFO (storage): (storage.c::841) completed flushing to storage
Jul 27 2015 20:49:20 GMT: INFO (as): (as.c::480) finished clean shutdown - exiting


<....>

Jul 27 2015 21:10:53 GMT: INFO (drv_ssd): (drv_ssd.c::4392) {foo} loaded 177031442 records, 0 subrecords, /dev/sdb 88%, /dev/sdc 100%
Jul 27 2015 21:10:55 GMT: INFO (drv_ssd): (drv_ssd.c::3615) device /dev/sdb: read complete: UNIQUE 88534151 (REPLACED 4289459) (GEN 63853933) (EXPIRED 1213288) (MAX-TTL 0) records
Jul 27 2015 21:10:55 GMT: INFO (drv_ssd): (drv_ssd.c::1051) ns foo loading free & defrag queues
Jul 27 2015 21:10:55 GMT: INFO (drv_ssd): (drv_ssd.c::985) /dev/sdb init defrag profile: 2670,2942,2147,1731,1340,1223,1084,1064,944,907,793,707,692,628,614,598,501,459,457,397,373,341,331,320,283,294,267
,249,242,231,206,193,198,171,193,161,195,164,154,166,169,165,156,167,164,135,164,163,155,161,147,153,153,162,176,160,133,180,191,178,158,172,170,186,215,208,219,262,254,263,275,334,380,386,499,663,791,100
3,1275,1862,2642,3449,3736,3428,3190
Jul 27 2015 21:10:55 GMT: INFO (drv_ssd): (drv_ssd.c::985) /dev/sdc init defrag profile: 2805,2895,2094,1617,1400,1186,1117,987,936,878,851,781,722,673,579,594,547,458,419,392,392,405,310,317,263,303,271,
239,243,217,217,216,185,192,182,167,193,191,166,180,174,175,160,179,166,158,142,165,165,167,177,150,170,162,173,186,172,173,170,184,183,178,194,232,188,233,220,215,250,275,317,340,375,437,524,591,764,953,
1266,1759,2624,3548,3740,3412,3198
Jul 27 2015 21:10:55 GMT: INFO (drv_ssd): (drv_ssd.c::1075) /dev/sdb init wblock free-q 9541, defrag-q 55882
Jul 27 2015 21:10:55 GMT: INFO (drv_ssd): (drv_ssd.c::1075) /dev/sdc init wblock free-q 9377, defrag-q 56164

<....>

As you see - before reload defrag-q is empty and after has some big numbers.

Ah! I understand what you are describing, it was obvious in your original post but didn’t click. What you are seeing is something we have recently begun to describe as “coldstart fragmentation”. Basically during the running of Aerospike, the defrag threads recombine older records with the incoming write flow and the records are rewritten to disk. This results in the same record (same generation and TTL) appearing multiple times on the disk as identical copies. The location of the active copy of the records are tracked by the primary index. When a blocked is defragmented, the previous copies of the records remain on disk until a new write overwrites that block.

Enter coldstart, we no longer have the primary index and we must rebuild the index from the contents of the disks. When we read record R of generation G which has an unexpired TTL we store and index entry for R with generation G into the primary index. If we see R with the same generation, the copy will be ignored. Since coldstart may not use the same copy as the prior run the disk starts up in a fragmented state.

So the defrag-q being 0 before shutdown and large after coldstart is a known and expected behavior and is definitely a red herring for what may be causing the actual issue.

So let’s return to:

First, it would really help to see your configuration and the output of asadm -e "info" though you may need to sanitize private information.