We ran into HWM breach and cluster went down. I don’t see asd process anymore on servers. Ideally it should either keep evicting or pause writes when it reaches 90%. But why is the node not responding to hearbeat and cluster size getting dropped to 0?
If you configure memory for a namespace - say 4G RAM - and your actual instance is only 2G RAM - Aerospike will not check if you actually have 4G on your instance. It will start allocating in 1G chunks on Enterprise Edition and you can crash the server even before breaching HWM. So, I would first check actual memory available on the machine vs what you allocated to the namespace - combined for all namespaces you have and make sure you have enough headroom for the OS and other processes. This may be a case of linux killing aerospike because it was pulling memory over what linux could give it.
Looks like memory is not the issue at present but your configuration is regardless a recipe for trouble. So, what was the memory situation when you crashed? We are looking at the memory now … how many records did you insert, what was the replication factor, how many nodes in the cluster, what were the bins in the records - type and data size?
So “fiction” records are ~5KB in data, “stories” are about 1.5KB in data. In your case, 64 bytes for PI is negligible compared to data size. You have 206K “fiction” objects and “216K” stories objects in this snapshot. If you were to keep inserting this proportion of records, at approximately 23X - ie (20623)K and (21623)K records - roughly speaking 5 million records of each set, you may crash the system because of the way you have set your HWM settings. Is this what you are seeing? This is I am assuming a single node cluster - replication factor=1.
Ok, so you have 5 node cluster with a replication factor of 2. So there are two copies of every record. Each node is holding about total 800K records currently, 400K Master copies for some partitions, and 400K replica copies for other partitions. You have namespace configured for data-in-memory true. So I am trying to see how much data is being stored in RAM. With 800K (400K master, 400K replica) records on a node, you are using 2.28 GB of RAM or about 10% (says 9) for round numbers. When you will hit 10x that number, ie 4 to 5 million master and 4 to 5 million replica, ie 8 to 10 million total records, you may crash the memory because you have set your namespace for 30G and your instance is also 30G underneath. So, can you replicate the crash? Is it true that you will keep adding records till you hit 4 million or so master records? If so, all you have to do is to reduce your memory HWM to the recommended 60% with stop-writes at 90% - and then add more nodes to your cluster to accommodate your total records requirement. What is the maximum number of records you intend to store?
Thanks Piyush, But let me put this question in a way so I get what I’m looking for.
I understand the HWM was high. But we had OOM when mem given to NS is 28 and HWM was 75% . Cluster was down, syslog mentioned OOM. I know I have gone beyond the normal, but this cluster down doesn’t make sense with OOM for 28GB.
Ok lets forget what has happened.
Going forward should I make SYSTEM_MEM - NS_MEM = 5GB should be the difference? Always on all aerospike systems? And it doesn’t matter even if I set HWM to 95% and STOP writes to 98%. Cluster wouldn’t go down, its ok if writes are stopping. Is there any reasons that cluster can go down?
I just want to clarify one point, without looking at all the details here: there are no settings that can fully prevent you from crossing the allocated memory on a namespace in any situation. 2 simple examples:
1- The high water mark will only trigger evictions which are limited to 0.5% per cycle and the cycles can take a long time to come (even with the default 2 minutes configuration), based on different parameters.
2- The stop writes will only prevent writes hitting a node directly (master side) but will let migrate writes and prole writes through…
So even if you do have a namespace configured for 30GB memory and have 50GB total RAM, you can still, in some extreme situations go out of RAM and have a namespace use much more than the 30GB it has allocated.
Now to answer your question, the system memory - namespace memory is not the only thing to be cautious about. So whether you give it 5GB or 10GB or more is not going to necessarily prevent OOM situations (I am exaggerating on purpose to make sure you understand why). Of course, in your specific situation, something else may have happened as well… We would need to look at the logs in detail.
There is a time component to all this. Like @kporter explained earlier, nsup thread that runs every 120 seconds by default, detects breach of HWM and evicts data. The amount of records that can be evicted in any given nsup cycle are determined by evict-tenths-pct and evict-hist-buckets parameter. Stop writes has a 10 second time window since ver 22.214.171.124. Prior to 126.96.36.199, nsup used to declare stop-writes also. You are running 188.8.131.52, so nsup not declaring stop-writes is not the issue. So if evictions are not giving you enough relief, your data velocity is so high that you can blow through stop-writes and beyond in less than 10 seconds, then you are in trouble. So you really have to get some numbers on your data write/update rate and analyze this. With HWM set at 75%, what was stop writes set at? HWM gives some relief to memory consumption with evictions but if you are writing fast enough and deletions are slowing down nsup starts (grep thr_nsup in aerospike.log and look at total time) or as Meher pointed out, writes to other nodes can cause replica writes on this node which are not hindered by stop-writes, you can get into a bad situation. The only way your OOM situation can be fully analyzed is by sending your logs using collectinfo to our support folks (trust you are running Enterprise Edition on this cluster). This is something unique to your situation.
I understand that the cycles to expire is default 2 mins and it can sneak the data in to cluster from apps also migrates within internode communication can cause data growth. These factors can go beyond configured NS HWM and apps are kept on writing in a very high TPS it can take down all memory in the system which can potentially cause OOM . - Correct me if I’m wrong, if any details to be given from my end interms of logs. I can provide any logs you need. Just wanted to get to the bottom of this problem. If such help is not anticipated from community, thats fine. I completely understand, but am interested to see what is the actual problem.
@Piyush Reg - >
the only way your OOM situation can be fully analyzed is by sending your logs using collectinfo to our support folks (trust you are running Enterprise Edition on this cluster). This is something unique to your situation. → I have PM’d you. And I might continue asking in community forum. If the efforts needed to answer this problem needs Enterprise. I totally understand. Lets not continue this thread anymore.
Considering various other factors it is always stay good on recommended HWM and don’t increase HWM value. I will ensure this is happening for this I have come up with formulae: