sdnaffi
December 27, 2018, 4:43am
1
Hi,
Our server suddenly hung and need to restart the box to bring it back to live. aerospike log at the time of server hung is as follows,
Dec 27 2018 03:46:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {namesapce} /dev/sdf: used-bytes 122008425472 free-wblocks 2428167 write-q 0 write (9919034,3.3) defrag-q 11 defrag-read (9772465,3.7) defrag-write (4731407,1.5)
Could you please help me to find the root cause?
1 Like
kporter
December 27, 2018, 5:31am
2
Was it OOM?
grep -i 'killed process' /var/log/messages
When you say ‘hung’ was the process still running?
What version of Aerospike are you running?
sdnaffi
December 27, 2018, 5:44am
3
Yes process was in running status and version 3.13.0.11
running on Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-162-generic x86_64)
kporter
December 27, 2018, 5:46am
4
How did you verify that the proctwas still running?
sdnaffi
December 27, 2018, 5:52am
5
/etc/init.d/aerospike status
it took long time to return a result.
Also ACM shows node down status.
Then we restarted the node box twice to bring online
kporter
December 27, 2018, 7:30am
6
What was the output of the grep command I provided?
sdnaffi
December 31, 2018, 4:11am
7
grep: /var/log/messages: No such file or directory
kporter
December 31, 2018, 4:40am
8
Try
grep -i 'killed process' /var/log/syslog
You may need to see if there is a archived log from the date of this incident.
kporter
December 31, 2018, 4:45am
10
Since it has been several days, the logs have probably been archived into a .gz file.
You will need to gcat the archived file from the date of this incident and grep for that string.
Typically when logs suddenly come to a halt it indicates that the process was killed by the kernel’s OOM killer which could mean that your configuration over utilizes this machine.
sdnaffi
December 31, 2018, 4:53am
11
[ 16.809047] init: failsafe main process (1208) killed by TERM signal
1hr back same issue happened and we had to restart the server. Above what i got from log.