HWM breached and cluster went down


#1

We ran into HWM breach and cluster went down. I don’t see asd process anymore on servers. Ideally it should either keep evicting or pause writes when it reaches 90%. But why is the node not responding to hearbeat and cluster size getting dropped to 0?

Build - C-3.15.1.3

Apr 08 2018 11:41:06 GMT: INFO (info): (ticker.c:435) {nano} memory-usage: total-bytes 23729610339 index-bytes 435985216 sindex-bytes 0 data-bytes 23293625123 used-pct 85.00
Apr 08 2018 11:41:06 GMT: INFO (info): (ticker.c:465) {nano} device-usage: used-bytes 24223068928 avail-pct 65
Apr 08 2018 11:41:06 GMT: INFO (info): (ticker.c:534) {nano} client: tsvc (0,1) proxy (186,0,0) read (3798065,0,0,9966748) write (3879032,847,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0)
Apr 08 2018 11:41:06 GMT: INFO (info): (ticker.c:684) {nano} retransmits: migration 0 client-read 0 client-write (0,3) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0) nsup 0
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:139) histogram dump: {nano}-read (13764813 total) msec
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:156)  (00: 0013738942) (01: 0000023435) (02: 0000001996) (03: 0000000425)
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:165)  (04: 0000000011) (05: 0000000002) (07: 0000000002)
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:139) histogram dump: {nano}-write (3879879 total) msec
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:156)  (00: 0003608154) (01: 0000204420) (02: 0000039144) (03: 0000012968)
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:156)  (04: 0000008618) (05: 0000005165) (06: 0000001168) (07: 0000000152)
Apr 08 2018 11:41:06 GMT: INFO (info): (hist.c:165)  (08: 0000000082) (09: 0000000008)
Apr 08 2018 11:41:13 GMT: WARNING (socket): (socket.c:749) Error while connecting: 111 (Connection refused)
Apr 08 2018 11:41:13 GMT: WARNING (socket): (socket.c:808) Error while connecting socket to 11.242.116.114:3002
Apr 08 2018 11:41:13 GMT: WARNING (hb): (hb.c:4669) could not create heartbeat connection to node {11.242.116.114:3002}
Apr 08 2018 11:41:14 GMT: INFO (drv_ssd): (drv_ssd.c:2072) {nano} /var/lib/aerospike/nano.dat: used-bytes 24223664640 free-wblocks 586236 write-q 0 write (206175,1.1) defrag-q 0 defrag-read (117607,0.1) defrag-write (39713,0.1)

#2

If you configure memory for a namespace - say 4G RAM - and your actual instance is only 2G RAM - Aerospike will not check if you actually have 4G on your instance. It will start allocating in 1G chunks on Enterprise Edition and you can crash the server even before breaching HWM. So, I would first check actual memory available on the machine vs what you allocated to the namespace - combined for all namespaces you have and make sure you have enough headroom for the OS and other processes. This may be a case of linux killing aerospike because it was pulling memory over what linux could give it.


#3
     memory-size 30G
     replication-factor 2
     default-ttl 30d
     high-water-memory-pct 95
     stop-writes-pct 98
     storage-engine device {
         file /var/lib/aerospike/nano.dat
         filesize 110G
         data-in-memory true
         write-block-size 128K
         defrag-lwm-pct 50
         defrag-startup-minimum 10
     }
 }

System Mem:

# free -g
              total         
Mem:         31

#4

Looks like you are running close to the RAM limits of this box, leaving a very small slice for system overhead.

Stop writes is evaluated every 10 seconds.

Memory-hwm is evaluated every NSUP cycle, default is 120 seconds but may exceed setting depending on the amount of data.

Likely the 2 GB wasn’t enough buffer for OS and services + Aerospike overhead. Check /var/log/messages for oom-killer.


#5

Can you share the full output of

$ free -m


#6
mannoj#free -m
              total        used        free      shared  buff/cache   available
Mem:          32174        2587       26689         336        2898       28865
Swap:             0           0           0

#7

Looks like memory is not the issue at present but your configuration is regardless a recipe for trouble. So, what was the memory situation when you crashed? We are looking at the memory now … how many records did you insert, what was the replication factor, how many nodes in the cluster, what were the bins in the records - type and data size?


#8

Found something for you guys.

Apr6th I had setting of NS mem to be 28G. Later it hit HWM but the alert was on cluster size. below was OOM for it.

/var/log/syslog.2.gz:1307:Apr  6 22:48:18 prd-nano001 kernel: [8417502.271227] asd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
/var/log/syslog.2.gz:1317:Apr  6 22:48:18 prd-nano001 kernel: [8417502.271310]  [<ffffffff814cbe5a>] ? virtballoon_oom_notify+0x2a/0x80
/var/log/syslog.2.gz:1318:Apr  6 22:48:18 prd-nano001 kernel: [8417502.271319]  [<ffffffff81191442>] oom_kill_process+0x202/0x3c0
/var/log/syslog.2.gz:1359:Apr  6 22:48:18 prd-nano001 kernel: [8417502.271490] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

I increased the size to 30G and made HWM to be 85% and restarted aerospike.

Today evening we faced same oom along with initial log on top from aerospike.

/var/log/syslog:859:Apr  8 17:11:15 prd-nano001 kernel: [8570077.167388] asd invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
/var/log/syslog:869:Apr  8 17:11:15 prd-nano001 kernel: [8570077.167477]  [<ffffffff81191442>] oom_kill_process+0x202/0x3c0
/var/log/syslog:909:Apr  8 17:11:15 prd-nano001 kernel: [8570077.167668] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

#9

Well, need to know what records you are inserting so I can do a quick sizing calculation.


#10

Any specific details you are looking for? I managed to pull out this.

SHOW BINS;
+-------+--------+-------+-----------+
| quota | bin    | count | namespace |
+-------+--------+-------+-----------+
| 32768 | "data" | 1     | "nano"     |
+-------+--------+-------+-----------+

SHOW SETS;
+------------------+-------+----------------+----------+-------------------+---------------------+-------------------+--------------+------------+
| disable-eviction | ns    | set-enable-xdr | objects  | stop-writes-count | set                 | memory_data_bytes | truncate_lut | tombstones |
+------------------+-------+----------------+----------+-------------------+---------------------+-------------------+--------------+------------+
| "false"          | "nano" | "use-default"  | "206061" | "0"               | "fiction"           | "1050167270"      | "0"          | "0"        |
| "false"          | "nano" | "use-default"  | "216795" | "0"               | "stories"           | "319045270"       | "0"          | "0"        |
+------------------+-------+----------------+----------+-------------------+---------------------+-------------------+--------------+------------+

#11

So “fiction” records are ~5KB in data, “stories” are about 1.5KB in data. In your case, 64 bytes for PI is negligible compared to data size. You have 206K “fiction” objects and “216K” stories objects in this snapshot. If you were to keep inserting this proportion of records, at approximately 23X - ie (20623)K and (21623)K records - roughly speaking 5 million records of each set, you may crash the system because of the way you have set your HWM settings. Is this what you are seeing? This is I am assuming a single node cluster - replication factor=1.

Perhaps you can share output of:

$ asadm
Admin> info

#12
Admin> info
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                             Node               Node                    Ip        Build   Cluster   Migrations        Cluster     Cluster         Principal   Client     Uptime
                                                .                 Id                     .            .      Size            .            Key   Integrity                 .    Conns          .
prd-nano001:3000   *BB9D964FB005452   11.242.116.114:3000   C-3.15.1.3         5      0.000     A62EC8DE0075   True        BB9D964FB005452       72   13:29:21
prd-nano002:3000   BB9D564FB005452    11.242.116.115:3000   C-3.15.1.3         5      0.000     A62EC8DE0075   True        BB9D964FB005452       75   13:29:21
prd-nano003:3000   BB9D664FB005452    11.242.116.116:3000   C-3.15.1.3         5      0.000     A62EC8DE0075   True        BB9D964FB005452       72   13:29:21
prd-nano004:3000   BB9D464FB005452    11.242.116.117:3000   C-3.15.1.3         5      0.000     A62EC8DE0075   True        BB9D964FB005452       73   13:29:21
prd-nano005:3000   BB9D864FB005452    11.242.116.118:3000   C-3.15.1.3         5      0.000     A62EC8DE0075   True        BB9D964FB005452       79   13:29:21
Number of rows: 5

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                Node       Total   Expirations,Evictions     Stop        Disk    Disk     HWM   Avail%         Mem     Mem    HWM      Stop
        .                                                   .     Records                       .   Writes        Used   Used%   Disk%        .        Used   Used%   Mem%   Writes%
nano         prd-nano001:3000   798.920 K   (152.050 K, 0.000)      false     2.349 GB   3       50      96        2.280 GB   9       95     98
nano         prd-nano002:3000   798.208 K   (154.550 K, 0.000)      false     2.349 GB   3       50      96        2.280 GB   9       95     98
nano         prd-nano003:3000   780.886 K   (150.369 K, 0.000)      false     2.299 GB   3       50      96        2.231 GB   9       95     98
nano         prd-nano004:3000   793.509 K   (161.825 K, 0.000)      false     2.336 GB   3       50      96        2.268 GB   9       95     98
nano         prd-nano005:3000   775.571 K   (152.686 K, 0.000)      false     2.282 GB   3       50      96        2.215 GB   9       95     98
nano                                                               3.947 M   (771.480 K, 0.000)               11.614 GB                            11.273 GB
Number of rows: 6

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                Node       Total     Repl                           Objects                   Tombstones             Pending   Rack
        .                                                   .     Records   Factor        (Master,Prole,Non-Replica)   (Master,Prole,Non-Replica)            Migrates     ID
        .                                                   .           .        .                                 .                            .             (tx,rx)      .
nano         prd-nano001:3000   798.920 K   2        (388.242 K, 410.678 K, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
nano         prd-nano002:3000   798.208 K   2        (394.717 K, 403.491 K, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
nano         prd-nano003:3000   780.886 K   2        (385.908 K, 394.978 K, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
nano         prd-nano004:3000   793.511 K   2        (415.640 K, 377.871 K, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
nano         prd-nano005:3000   775.571 K   2        (389.041 K, 386.530 K, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
nano                                                               3.947 M            (1.974 M, 1.974 M, 0.000)         (0.000,  0.000,  0.000)      (0.000,  0.000)
Number of rows: 6

#13

Also trying to understand what you said about data sizing pattern in above comment.


#14

Ok, so you have 5 node cluster with a replication factor of 2. So there are two copies of every record. Each node is holding about total 800K records currently, 400K Master copies for some partitions, and 400K replica copies for other partitions. You have namespace configured for data-in-memory true. So I am trying to see how much data is being stored in RAM. With 800K (400K master, 400K replica) records on a node, you are using 2.28 GB of RAM or about 10% (says 9) for round numbers. When you will hit 10x that number, ie 4 to 5 million master and 4 to 5 million replica, ie 8 to 10 million total records, you may crash the memory because you have set your namespace for 30G and your instance is also 30G underneath. So, can you replicate the crash? Is it true that you will keep adding records till you hit 4 million or so master records? If so, all you have to do is to reduce your memory HWM to the recommended 60% with stop-writes at 90% - and then add more nodes to your cluster to accommodate your total records requirement. What is the maximum number of records you intend to store?


#15

See https://www.aerospike.com/docs/operations/plan/capacity


#16

Thanks Piyush, But let me put this question in a way so I get what I’m looking for.

I understand the HWM was high. But we had OOM when mem given to NS is 28 and HWM was 75% . Cluster was down, syslog mentioned OOM. I know I have gone beyond the normal, but this cluster down doesn’t make sense with OOM for 28GB.

Ok lets forget what has happened.

Going forward should I make SYSTEM_MEM - NS_MEM = 5GB should be the difference? Always on all aerospike systems? And it doesn’t matter even if I set HWM to 95% and STOP writes to 98%. Cluster wouldn’t go down, its ok if writes are stopping. Is there any reasons that cluster can go down?


#17

Hi Mannoj,

I just want to clarify one point, without looking at all the details here: there are no settings that can fully prevent you from crossing the allocated memory on a namespace in any situation. 2 simple examples:

1- The high water mark will only trigger evictions which are limited to 0.5% per cycle and the cycles can take a long time to come (even with the default 2 minutes configuration), based on different parameters.

2- The stop writes will only prevent writes hitting a node directly (master side) but will let migrate writes and prole writes through…

So even if you do have a namespace configured for 30GB memory and have 50GB total RAM, you can still, in some extreme situations go out of RAM and have a namespace use much more than the 30GB it has allocated.

Now to answer your question, the system memory - namespace memory is not the only thing to be cautious about. So whether you give it 5GB or 10GB or more is not going to necessarily prevent OOM situations (I am exaggerating on purpose to make sure you understand why). Of course, in your specific situation, something else may have happened as well… We would need to look at the logs in detail.

Best, –meher


#18

There is a time component to all this. Like @kporter explained earlier, nsup thread that runs every 120 seconds by default, detects breach of HWM and evicts data. The amount of records that can be evicted in any given nsup cycle are determined by evict-tenths-pct and evict-hist-buckets parameter. Stop writes has a 10 second time window since ver 3.15.0.1. Prior to 3.15.0.1, nsup used to declare stop-writes also. You are running 3.15.1.3, so nsup not declaring stop-writes is not the issue. So if evictions are not giving you enough relief, your data velocity is so high that you can blow through stop-writes and beyond in less than 10 seconds, then you are in trouble. So you really have to get some numbers on your data write/update rate and analyze this. With HWM set at 75%, what was stop writes set at? HWM gives some relief to memory consumption with evictions but if you are writing fast enough and deletions are slowing down nsup starts (grep thr_nsup in aerospike.log and look at total time) or as Meher pointed out, writes to other nodes can cause replica writes on this node which are not hindered by stop-writes, you can get into a bad situation. The only way your OOM situation can be fully analyzed is by sending your logs using collectinfo to our support folks (trust you are running Enterprise Edition on this cluster). This is something unique to your situation.


#19

Thanks Meher and Piyush,

I understand that the cycles to expire is default 2 mins and it can sneak the data in to cluster from apps also migrates within internode communication can cause data growth. These factors can go beyond configured NS HWM and apps are kept on writing in a very high TPS it can take down all memory in the system which can potentially cause OOM . - Correct me if I’m wrong, if any details to be given from my end interms of logs. I can provide any logs you need. Just wanted to get to the bottom of this problem. If such help is not anticipated from community, thats fine. I completely understand, but am interested to see what is the actual problem.

@Piyush Reg - > the only way your OOM situation can be fully analyzed is by sending your logs using collectinfo to our support folks (trust you are running Enterprise Edition on this cluster). This is something unique to your situation. -> I have PM’d you. And I might continue asking in community forum. If the efforts needed to answer this problem needs Enterprise. I totally understand. Lets not continue this thread anymore.

Considering various other factors it is always stay good on recommended HWM and don’t increase HWM value. I will ensure this is happening for this I have come up with formulae:

formulae : { mem_used_pct% + [ mem_used_pct% / (Total_nodes - accepted_#_of_nodeFail ) ] + 3%} < MEM_HWM%

accepted_#of_nodeFail = 1