Node Crash


#1

Good Morning,

This morning we discovered a node crash.

Dec 25 2014 00:47:52 GMT: INFO (info): (thr_info.c::4567)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 2
Dec 25 2014 00:47:52 GMT: INFO (info): (thr_info.c::4583) namespace user: disk inuse: 9564527616 memory inuse: 5740089462 (bytes) sindex memory inuse: 0 (bytes) avail pct 66
Dec 25 2014 00:47:52 GMT: INFO (info): (thr_info.c::4583) namespace bidder: disk inuse: 278400 memory inuse: 123647 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Dec 25 2014 00:47:52 GMT: INFO (info): (thr_info.c::4583) namespace request: disk inuse: 4315837056 memory inuse: 1842697864 (bytes) sindex memory inuse: 0 (bytes) avail pct 3
Dec 25 2014 00:47:52 GMT: INFO (info): (thr_info.c::4608)    partitions: actual 4005 sync 8283 desync 0 zombie 0 wait 0 absent 0
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: reads (2843615631 total) msec
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::154)  (00: 2842126775) (01: 0001054545) (02: 0000310603) (03: 0000111273)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::154)  (04: 0000012341) (05: 0000000044) (06: 0000000041) (07: 0000000006)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::163)  (08: 0000000003)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (470662124 total) msec
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::154)  (00: 0328704060) (01: 0021876293) (02: 0028338487) (03: 0035313458)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::154)  (04: 0034013758) (05: 0018557423) (06: 0003701511) (07: 0000156271)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (470662124 total) msec
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::154)  (00: 0328712481) (01: 0021873068) (02: 0028336820) (03: 0035311940)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::154)  (04: 0034012498) (05: 0018556799) (06: 0003701386) (07: 0000156269)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Dec 25 2014 00:47:52 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4520)  system memory: free 1776848kb ( 1 percent free )
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4527)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 83949287
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4535)  rec refs 83949287 ::: rec locks 0 ::: trees 0 ::: wr reqs 3 ::: mig tx 0 ::: mig rx 0
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4541)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4551)    trans_in_progress: wr 3 prox 0 wait 41 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (501, 1703205, 1702704) : hb
 25531, 25532) : fab (44, 44, 0)
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4553)    heartbeat_received: self 0 : foreign 1143464
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4554)    heartbeat_stats: bt 0 bf 584372 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 8 mrf 0 eh 0 efd 0 efa
 0 mcf 25528 rc 25524
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4567)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 1
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4583) namespace user: disk inuse: 9566565632 memory inuse: 5741316897 (bytes) sindex memory inuse: 0 (bytes) avail pct 66
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4583) namespace bidder: disk inuse: 278400 memory inuse: 123628 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4583) namespace request: disk inuse: 4324796416 memory inuse: 1846526741 (bytes) sindex memory inuse: 0 (bytes) avail pct 3
Dec 25 2014 00:48:02 GMT: INFO (info): (thr_info.c::4608)    partitions: actual 4005 sync 8283 desync 0 zombie 0 wait 0 absent 0
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: reads (2844305518 total) msec
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::154)  (00: 2842816601) (01: 0001054572) (02: 0000310632) (03: 0000111278)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::154)  (04: 0000012341) (05: 0000000044) (06: 0000000041) (07: 0000000006)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::163)  (08: 0000000003)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (470746271 total) msec
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::154)  (00: 0328751325) (01: 0021880790) (02: 0028345668) (03: 0035323082)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::154)  (04: 0034023457) (05: 0018562458) (06: 0003702335) (07: 0000156293)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (470746271 total) msec
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::154)  (00: 0328759747) (01: 0021877566) (02: 0028344001) (03: 0035321562)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::154)  (04: 0034022198) (05: 0018561833) (06: 0003702210) (07: 0000156291)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Dec 25 2014 00:48:02 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Dec 25 2014 00:48:12 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /database/user.dat: used 9568460160, contig-free 32818M (262549 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 26329
7/s), defrag-q 0 defrag-tot 132639 (2.9/s)
Dec 25 2014 00:48:12 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /database/bidder.dat: used 278400, contig-free 16382M (131060 wblocks), swb-free 14, n-w 0, w-q 0 w-tot 122481
91.2/s), defrag-q 0 defrag-tot 12247722 (291.0/s)
Dec 25 2014 00:48:12 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /database/request.dat: used 4333523840, contig-free 4651M (37212 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 9785
4.6/s), defrag-q 0 defrag-tot 0 (0.0/s)
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4520)  system memory: free 1757920kb ( 1 percent free )
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4527)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 83986538
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4535)  rec refs 83986541 ::: rec locks 3 ::: trees 0 ::: wr reqs 2 ::: mig tx 0 ::: mig rx 0
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4541)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4551)    trans_in_progress: wr 2 prox 0 wait 41 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (470, 1703596, 1703126) : hb
 25536, 25537) : fab (44, 44, 0)
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4553)    heartbeat_received: self 0 : foreign 1143662
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4554)    heartbeat_stats: bt 0 bf 584476 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 8 mrf 0 eh 0 efd 0 efa
 0 mcf 25533 rc 25529
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4567)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 9
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4583) namespace user: disk inuse: 9568498688 memory inuse: 5742477756 (bytes) sindex memory inuse: 0 (bytes) avail pct 66
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4583) namespace bidder: disk inuse: 278400 memory inuse: 123655 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4583) namespace request: disk inuse: 4333652736 memory inuse: 1850311540 (bytes) sindex memory inuse: 0 (bytes) avail pct 3
Dec 25 2014 00:48:12 GMT: INFO (info): (thr_info.c::4608)    partitions: actual 4005 sync 8283 desync 0 zombie 0 wait 0 absent 0
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: reads (2844986362 total) msec
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::154)  (00: 2843497404) (01: 0001054597) (02: 0000310645) (03: 0000111281)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::154)  (04: 0000012341) (05: 0000000044) (06: 0000000041) (07: 0000000006)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::163)  (08: 0000000003)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (470830421 total) msec
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::154)  (00: 0328798634) (01: 0021885173) (02: 0028352940) (03: 0035332765)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::154)  (04: 0034033023) (05: 0018567541) (06: 0003703178) (07: 0000156304)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (470830421 total) msec
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::154)  (00: 0328807058) (01: 0021881949) (02: 0028351272) (03: 0035331244)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::154)  (04: 0034031764) (05: 0018566916) (06: 0003703053) (07: 0000156302)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Dec 25 2014 00:48:12 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4520)  system memory: free 1737532kb ( 1 percent free )
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4527)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 84024211
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4535)  rec refs 84024212 ::: rec locks 1 ::: trees 0 ::: wr reqs 13 ::: mig tx 0 ::: mig rx 0
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4541)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4551)    trans_in_progress: wr 3 prox 0 wait 26 ::: q 4 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (489, 1703893, 1703404) : hb
 25540, 25541) : fab (44, 44, 0)
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4553)    heartbeat_received: self 0 : foreign 1143863
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4554)    heartbeat_stats: bt 0 bf 584583 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 8 mrf 0 eh 0 efd 0 efa
 0 mcf 25537 rc 25533
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4567)    tree_counts: nsup 0 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 18
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4583) namespace user: disk inuse: 9570457344 memory inuse: 5743655025 (bytes) sindex memory inuse: 0 (bytes) avail pct 66
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4583) namespace bidder: disk inuse: 278400 memory inuse: 123641 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4583) namespace request: disk inuse: 4342589184 memory inuse: 1854130228 (bytes) sindex memory inuse: 0 (bytes) avail pct 3
Dec 25 2014 00:48:22 GMT: INFO (info): (thr_info.c::4608)    partitions: actual 4005 sync 8283 desync 0 zombie 0 wait 0 absent 0
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: reads (2845670158 total) msec
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::154)  (00: 2844181119) (01: 0001054630) (02: 0000310665) (03: 0000111309)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::154)  (04: 0000012341) (05: 0000000044) (06: 0000000041) (07: 0000000006)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::163)  (08: 0000000003)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (470914724 total) msec
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::154)  (00: 0328846058) (01: 0021889695) (02: 0028360028) (03: 0035342551)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::154)  (04: 0034042504) (05: 0018572638) (06: 0003704066) (07: 0000156321)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (470914724 total) msec
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::154)  (00: 0328854482) (01: 0021886471) (02: 0028358362) (03: 0035341028)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::154)  (04: 0034041246) (05: 0018572012) (06: 0003703941) (07: 0000156319)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Dec 25 2014 00:48:22 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Dec 25 2014 00:48:31 GMT: INFO (nsup): (thr_nsup.c::1221) {user} nsup start
Dec 25 2014 00:48:32 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /database/user.dat: used 9572407040, contig-free 32811M (262494 wblocks), swb-free 1, n-w 0, w-q 0 w-tot 26340
6/s), defrag-q 0 defrag-tot 132696 (2.8/s)
Dec 25 2014 00:48:32 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /database/bidder.dat: used 278400, contig-free 16382M (131060 wblocks), swb-free 14, n-w 0, w-q 0 w-tot 122538
84.6/s), defrag-q 0 defrag-tot 12253404 (284.1/s)
Dec 25 2014 00:48:32 GMT: INFO (drv_ssd): (drv_ssd.c::2359) device /database/request.dat: used 4351214336, contig-free 4615M (36923 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 9788
4.4/s), defrag-q 0 defrag-tot 0 (0.0/s)
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4520)  system memory: free 1718276kb ( 1 percent free )
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4527)  migrates in progress ( 0 , 0 ) ::: ClusterSize 3 ::: objects 84061630
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4535)  rec refs 84077976 ::: rec locks 1 ::: trees 0 ::: wr reqs 15 ::: mig tx 0 ::: mig rx 0
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4541)  replica errs :: null 0 non-null 0 ::: sync copy errs :: node 0 :: master 0
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4551)    trans_in_progress: wr 3 prox 0 wait 41 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (465, 1704279, 1703814) : hb
 25545, 25546) : fab (44, 44, 0)
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4553)    heartbeat_received: self 0 : foreign 1144061
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4554)    heartbeat_stats: bt 0 bf 584686 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 8 mrf 0 eh 0 efd 0 efa
 0 mcf 25542 rc 25538
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4567)    tree_counts: nsup 1 scan 0 batch 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 19
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4583) namespace user: disk inuse: 9572447232 memory inuse: 5744852667 (bytes) sindex memory inuse: 0 (bytes) avail pct 66
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4583) namespace bidder: disk inuse: 278400 memory inuse: 123614 (bytes) sindex memory inuse: 0 (bytes) avail pct 99
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4583) namespace request: disk inuse: 4351333888 memory inuse: 1857866285 (bytes) sindex memory inuse: 0 (bytes) avail pct 3
Dec 25 2014 00:48:32 GMT: INFO (info): (thr_info.c::4608)    partitions: actual 4005 sync 8283 desync 0 zombie 0 wait 0 absent 0
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: reads (2846356598 total) msec
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::154)  (00: 2844867428) (01: 0001054696) (02: 0000310701) (03: 0000111338)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::154)  (04: 0000012341) (05: 0000000044) (06: 0000000041) (07: 0000000006)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::163)  (08: 0000000003)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (470997538 total) msec
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::154)  (00: 0328892427) (01: 0021894184) (02: 0028367000) (03: 0035352149)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::154)  (04: 0034051955) (05: 0018577656) (06: 0003704961) (07: 0000156343)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: proxy (0 total) msec
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (470997538 total) msec
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::154)  (00: 0328900854) (01: 0021890959) (02: 0028365332) (03: 0035350626)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::154)  (04: 0034050697) (05: 0018577030) (06: 0003704836) (07: 0000156341)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::163)  (08: 0000000863)
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Dec 25 2014 00:48:32 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Dec 25 2014 00:48:32 GMT: WARNING (drv_ssd): (drv_ssd.c::252) device /database/request.dat - swb buf valloc failed
Dec 25 2014 00:48:32 GMT: WARNING (drv_ssd): (drv_ssd.c::1817) write bins: couldn't get swb

What exactly does this mean?

Server Version is 3.3.26 Node Config: http://dpaste.com/1TNDEST

Thx in advance.


#2

hi,

When you say a node crash - you mean one of the server processes crashed out sending a segfault? If so, can you please give the OS, version and the stack trace.

This failure could be related to write failures. We are checking into it.


#3

Hi,

Aerospike have been stop working without any segfault just like the log above.

OS: Ubuntu utopic

Kernel: Linux db02 3.10.23-xxxx-std-ipv6-64 #1 SMP Tue Mar 18 14:48:24 CE T 2014 x86_64 x86_64 x86_64 GNU/Linux

Thanks.


#4

Quick update on that,

our system is a 3 node setup. as you can see on the config above ns request have allocated file space of 120GB, it turned out that node2 run the file space competently full while node 1/3 are fine (used 7.3GB only). I just wonder what have cause the problem. I check all 3 logs there is nothing about segfault or warning. What we except is there went something wrong with the defragmentation. Because (see screenshot [1]) it looks like it wasn’t properly cleanup. The hardware for all 3 machines is identical.

[1]

thx in advance.

//edit

It looks very similar to this thread here;


#5

hi,

Can you please check your dmesg to see if there was any oom-killer involved. Is the node running only aerospike process or is it running something else as well?


#6

HI,

There is aerospike online running.

Indeed there was something about that:

[124667.595883] active_anon:31529161 inactive_anon:1060138 isolated_anon:32
 active_file:26 inactive_file:0 isolated_file:25
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:142230 slab_reclaimable:3736 slab_unreclaimable:12096
 mapped:0 shmem:2 pagetables:67216 bounce:0
 free_cma:0
[124667.595884] Node 0 DMA free:15892kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[124667.595886] lowmem_reserve[]: 0 1978 128953 128953
[124667.595888] Node 0 DMA32 free:508164kB min:704kB low:880kB high:1056kB active_anon:981456kB inactive_anon:530988kB active_file:8kB inactive_file:8kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2046724kB managed:2026268kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:4kB slab_reclaimable:588kB slab_unreclaimable:1748kB kernel_stack:72kB pagetables:3028kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:532 all_unreclaimable? yes
[124667.595890] lowmem_reserve[]: 0 0 126974 126974
[124667.595892] Node 0 Normal free:44864kB min:45252kB low:56564kB high:67876kB active_anon:125135188kB inactive_anon:3709564kB active_file:96kB inactive_file:0kB unevictable:0kB isolated(anon):128kB isolated(file):100kB present:132120576kB managed:130021844kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4kB slab_reclaimable:14356kB slab_unreclaimable:46636kB kernel_stack:2656kB pagetables:265836kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:6629 all_unreclaimable? yes
[124667.595894] lowmem_reserve[]: 0 0 0 0
[124667.595895] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 3*4096kB (MR) = 15892kB
[124667.595901] Node 0 DMA32: 58*4kB (UEM) 62*8kB (UEM) 39*16kB (UEM) 30*32kB (UEM) 32*64kB (UE) 34*128kB (UE) 21*256kB (UE) 19*512kB (UEM) 11*1024kB (UE) 5*2048kB (EM) 113*4096kB (MR) = 508168kB
[124667.595907] Node 0 Normal: 9138*4kB (UE) 0*8kB 1*16kB (R) 0*32kB 4*64kB (R) 1*128kB (R) 2*256kB (R) 1*512kB (R) 0*1024kB 1*2048kB (R) 1*4096kB (R) = 44120kB
[124667.595913] 8692 total pagecache pages
[124667.595913] 8560 pages in swap cache
[124667.595914] Swap cache stats: add 3382991, delete 3374431, find 408262/563627
[124667.595915] Free swap  = 0kB
[124667.595915] Total swap = 4094972kB
[124667.804407] 33554431 pages RAM
[124667.804408] 538190 pages reserved
[124667.804408] 3182859 pages shared
[124667.804409] 32871072 pages non-shared
[124667.804410] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[124667.804419] [ 2377]     0  2377     5950        2      15       86             0 cgmanager
[124667.804420] [ 2402]     0  2402     5921        0      16       70             0 cgproxy
[124667.804421] [ 2444]     0  2444     6277        0      15      349             0 upstart-udev-br
[124667.804423] [ 2452]     0  2452     8789        2      22      214         -1000 systemd-udevd
[124667.804424] [ 4920]   101  4920    10992        2      24      107             0 dbus-daemon
[124667.804426] [ 4934]   100  4934    63963       50      26      915             0 rsyslogd
[124667.804427] [ 5040]     0  5040     3853        0      11       85             0 upstart-socket-
[124667.804428] [ 5043]     0  5043     4165        0      12      388             0 upstart-file-br
[124667.804430] [ 5319]     0  5319     4075        1      12       38             0 getty
[124667.804431] [ 5323]     0  5323     4075        1      12       38             0 getty
[124667.804433] [ 5329]     0  5329     4075        1      12       38             0 getty
[124667.804434] [ 5330]     0  5330     4075        1      12       38             0 getty
[124667.804435] [ 5334]     0  5334     4075        1      13       38             0 getty
[124667.804436] [ 5336]     0  5336     4786        3      12       40             0 atd
[124667.804438] [ 5337]     0  5337     6505       22      16       45             0 cron
[124667.804439] [ 5354]     0  5354    14892        2      31      171         -1000 sshd
[124667.804440] [ 5369]     0  5369     4863       41      14       33             0 irqbalance
[124667.804442] [ 5423]     0  5423     3181        1      11       32             0 uuidd
[124667.804443] [ 5461]   103  5461   245573       12      75     9681             0 named
[124667.804445] [ 5518]     0  5518     1090        0       7       36             0 acpid
[124667.804446] [ 5520]     0  5520     3329        4      10       33             0 mdadm
[124667.804447] [ 5580]     0  5580     4075        1      12       38             0 getty
[124667.804449] [ 5665]     0  5665     9188        1      22       76             0 systemd-logind
[124667.804452] [19500]     0 19500 34918819 32580012   66786  1010817             0 asd
[124667.804453] Out of memory: Kill process 19500 (asd) score 958 or sacrifice child
[124667.804520] Killed process 19500 (asd) total-vm:139675276kB, anon-rss:130320048kB, file-rss:0kB
[138841.869533] systemd-logind[5665]: New session c6 of user root.

I wonder how that was happens …?


#7

hi,

This means that the machine ran out of memory. How much is the total memory installed on the machine? How much memory is configured for aerospike? Is there any other process running on the machine?

You should allow for some memory usage by OS and other processes while allowing some overhead memory free at the time of allocating memory for aerospike.

http://www.aerospike.com/docs/operations/troubleshoot/node/

http://www.aerospike.com/docs/operations/plan/capacity/#memory-required


#8

Thank you, I solved this now.