Server hung and last log defrag


#1

Hi,

Our server suddenly hung and need to restart the box to bring it back to live. aerospike log at the time of server hung is as follows,

Dec 27 2018 03:46:35 GMT: INFO (drv_ssd): (drv_ssd.c:2115) {namesapce} /dev/sdf: used-bytes 122008425472 free-wblocks 2428167 write-q 0 write (9919034,3.3) defrag-q 11 defrag-read (9772465,3.7) defrag-write (4731407,1.5)

Could you please help me to find the root cause?


#2

Was it OOM?

grep -i 'killed process' /var/log/messages

When you say ‘hung’ was the process still running?

What version of Aerospike are you running?


#3

Yes process was in running status and version 3.13.0.11

running on Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-162-generic x86_64)


#4

How did you verify that the proctwas still running?


#5

/etc/init.d/aerospike status

it took long time to return a result.

Also ACM shows node down status.

Then we restarted the node box twice to bring online


#6

What was the output of the grep command I provided?


#7

grep: /var/log/messages: No such file or directory


#8

Try

grep -i 'killed process' /var/log/syslog

You may need to see if there is a archived log from the date of this incident.


#9

No result. Empty


#10

Since it has been several days, the logs have probably been archived into a .gz file.

You will need to gcat the archived file from the date of this incident and grep for that string.

Typically when logs suddenly come to a halt it indicates that the process was killed by the kernel’s OOM killer which could mean that your configuration over utilizes this machine.


#11

[ 16.809047] init: failsafe main process (1208) killed by TERM signal

1hr back same issue happened and we had to restart the server. Above what i got from log.


#12

This would be unrelated.