Defragmentation not working as expected


#1

Hi,

I have Aerospike cluster with 7 nodes. According to AMC:

- 100K+ read tps, 40K+ write tps
- free disk space ~50%
- least available < 5%

defrag-lwm-pct is set to 80%. No effect at all. From aerospike.log:

Jul 16 2015 09:37:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/sdd4: used 181381585920, contig-free 274406M (2195254 wblocks), swb-free 4, n-w 0, w-q 0 w-tot 17010586 (72.1/s), defrag-q 0 defrag-tot 17126879 (70.4/s)
Jul 16 2015 09:37:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/sdb4: used 181140654336, contig-free 12526M (100208 wblocks), swb-free 4, n-w 0, w-q 0 w-tot 16877497 (69.2/s), defrag-q 0 defrag-tot 16973903 (65.6/s)
Jul 16 2015 09:37:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/sdc4: used 181445532288, contig-free 274348M (2194790 wblocks), swb-free 5, n-w 0, w-q 0 w-tot 16963417 (73.2/s), defrag-q 0 defrag-tot 17100932 (71.4/s)
Jul 16 2015 09:37:19 GMT: INFO (drv_ssd): (drv_ssd.c::2536) device /dev/sda4: used 181166546048, contig-free 12432M (99460 wblocks), swb-free 4, n-w 0, w-q 0 w-tot 18278085 (80.7/s), defrag-q 0 defrag-tot 18376599 (76.1/s)

How can I get rid of this situation?

Regards, Alex


#2

Hi Alex,

From the log snippet, seems you are writing faster than the defrag process could reclaim blocks, Whats the write-block-size you are using ? What kind of work load you are having Read/Write TPS ?


#3

Hi Vishnu,

write-block-size is 128k (ssd disks). I can’t figure out why having free cpu and disk resources defragmentation process idles.

Regards, Alex


#4

Setting the defrag-lwm-pct higher is probably counter productive. You should be fine around 50-55%, setting higher will cause defrag to have to move more data with less benefit.

By default, between each wblock defragged Aerospike sleeps 1000 micro-seconds before continuing to the next wblock. If you need to process the queue faster we need to reduce this sleep. To do this adjust the defrag-sleep parameter. You could try setting this sleep to 0 and see if defrag begins to catch up (if it cannot catch up at 0 we will need to try another approach). If you start having performance issues with the defrag-sleep set to 0, you will need to increase it and monitor avail_pct to ensure that defrag doesn’t fall behind with the increased setting.


#5

Hi kporter,

As I wrote above, defrag-lwm-pct is set to 80. Set defrag-sleep (yes, I checked the docs before asking questions ) to 0 didn’t resolve the issue. I can roll defrag-lwm-pct to 99 value but it’s scary me.

Regards, Alex


#6

Setting defrag-lwm-pct to 80 and especially 99 is going to be counter productive to your situation. I would recommend lowering it to around 50-55%. Otherwise you are going to drastically and unnecessarily increase write amplification from defrag.

Had you tried setting defrag-sleep to 0 and having defrag-lwm-pct at 50%?


#7

Hi kporter,

Already tried it. No success.

Let’s investigate further. Nodes are set as follows:

2x240GB SSD Samsung SM843 2x480GB SSD Samsung PM853

Two small disks hold (via linux mdraid) system image. Also hold swap partition. Free space (~190GB) on both disks is configured as raw partitions for Aerospike. So I have 4 slices that are used by Aerospike: 2x190GB, 2x447GB.

I suppose that defragmentation logic that is currently implemented in Aerospike code scans only first half of large disks (480GB) to determine if any blocks happened to be processed.

Regards, Alex


#8

This isn’t how to defrag algorithm works. An older algorithm scanned the disks, did you find reference to that in the docs? The current algorithm processes a queue containing defrag eligible wblocks. When an update or delete occurs and a wblock becomes defrag eligible it is immediately added to the defrag queue.

Aerospike expects that all the disks attached to a namespace are the same size. You will only be able to use 190GB from each disk.

I do not have ACT Benchmark numbers for these drives but based on:

Assuming this is aggregate across the 7 nodes, you are doing ~14K read and ~6K writes ops per node per second. Using the ACT benchmark utility this is a 6x test, have you used this benchmark to determine if your hardware can handle the load you are providing?

Could you provide the output of:

iostat -x 1 3 # iostat is provided by sysstat

#9

Hi kporter,

I only made an assumption how it may works based on how it does. If it becomes an issue I may go through the code to find out what logic is implemented (but prefer not to do it).

SSDs manufacturer can’t be changed right now although I’m aware of https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

iostat output (Ubuntu 12.04.5 LTS):

Linux 3.2.0-80-generic (host5.domain.com)        07/21/2015      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.46    0.00   19.34   21.79    0.00   24.41

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.03     9.20 2894.28  306.70 43705.42 39110.74    51.74     0.26    0.08    0.03    0.54   0.04  12.06
sdc               0.00    15.49 2866.13  289.75 41722.19 37135.85    49.98     0.38    0.12    0.07    0.65   0.05  15.99
sdd               0.00    15.49 2863.65  285.79 41295.77 36627.77    49.48     0.44    0.14    0.09    0.64   0.05  17.02
sdb               0.03     9.20 2892.99  304.31 43429.26 38803.97    51.44     0.27    0.09    0.03    0.61   0.04  13.17
md0               0.00     0.00    0.00    0.00     0.00     0.00     7.92     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.01    4.20     0.10   116.82    55.55     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     2.69     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          26.15    0.00    9.34    9.25    0.00   55.26

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00 2050.00   86.00 11967.50 11008.00    21.51     0.42    0.20    0.20    0.28   0.14  29.20
sdc               0.00     0.00 2043.00  107.00 15762.00 13696.00    27.40     0.52    0.24    0.23    0.49   0.14  30.80
sdd               0.00     0.00 2135.00  143.00 20076.50 18304.00    33.70     0.77    0.34    0.33    0.56   0.17  37.60
sdb               0.00     0.00 1969.00  113.00 16559.00 14464.00    29.80     0.43    0.21    0.20    0.28   0.12  25.20
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.00    0.00    9.42    9.67    0.00   52.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00 2010.00   83.00 11590.50 10624.00    21.23     0.34    0.16    0.16    0.19   0.12  26.00
sdc               0.00     0.00 2071.00  128.00 19207.00 16384.00    32.37     0.86    0.39    0.39    0.44   0.18  39.20
sdd               0.00     0.00 1983.00  125.00 17768.50 16000.00    32.04     0.64    0.30    0.29    0.48   0.17  36.40
sdb               0.00     0.00 2032.00  124.00 18227.00 15872.00    31.63     0.47    0.22    0.21    0.29   0.13  28.40
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Numbers are lower than usual btw.

Regards, Alex


#10

Based on the iostat output it doesn’t appear the drives are falling behind. But that depends on

How much higher they normally are.

I wasn’t suggesting that they be changed; I was suggesting that you benchmark them to determine what performance you should expect from Aerospike.


#11

Hi kporter,

Numbers are higher up to ~30-35% in peak hours. Well, these SSDs show lower speed when zeroing than *ntel drives we have in another DC.

Regards, Alex


#12

Sorry for the delay, could you provide a sar output that includes your peak hours?

# sar -d