@pgupta
Please find answers inline.
how many nodes?
8
Per node RAM?
96gb
90 million is master records?
yes
we had only set high-water-disk-pct 80, so stop-writes should be default.
Look through the logs and see if you can spot anything why you OOM’d
in the dmsg
[Mon Jun 27 22:11:12 2022] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[Mon Jun 27 22:11:12 2022] [10910] 0 10910 1201522 26742 152 8 0 0 java
[Mon Jun 27 22:11:12 2022] [15621] 0 15621 163807 69233 302 6 0 0 amc
[Mon Jun 27 22:11:12 2022] [27021] 999 27021 11988 2351 29 3 0 0 telemetry.py
[Mon Jun 27 22:11:12 2022] [27024] 0 27024 31509559 24094610 60202 124 0 0 asd
[Mon Jun 27 22:11:12 2022] Out of memory: Kill process 27024 (asd) score 946 or sacrifice child
[Mon Jun 27 22:11:12 2022] Killed process 27024 (asd) total-vm:126038236kB, anon-rss:96378440kB, file-rss:0kB, shmem-rss:0kB
Plus you would have hit stop-writes at 90% of 90GB before you got OOM’d due to incoming writes but that won’t stop replica, defrag and migration writes.
It never got till 90%, attaching logs before dying. other 3gb namespace is empty.
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992510013,0,0,3020503)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282178096,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:45:52 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242626222 total) msec
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614964169 total) msec
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282190261 total) msec
Jun 27 2022 16:45:52 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195924739 total) count
Jun 27 2022 16:45:52 GMT: INFO (drv_ssd): (drv_ssd.c:2185) {ourdb} /dev/vdb: used-bytes 60448233088 free-wblocks 2008439 write-q 0 write (548457506,18.5) defrag-q 0 defrag-read (548368576,2.7) defrag-write (265090565,1.3)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:389) {ourdb} objects: all 34125984 master 13310128 prole 20815860 non-replica 0
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:444) {ourdb} migrations: remaining (256,215,512) active (1,1,0) complete-pct 5.42
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:465) {ourdb} memory-usage: total-bytes 63183673505 index-bytes 2184062976 sindex-bytes 2204445346 data-bytes 58795165183 used-pct 65.38
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:517) {ourdb} device-usage: used-bytes 60702210544 avail-pct 95
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:585) {ourdb} client: tsvc (0,3) proxy (555,0,32) read (39500829388,0,0,741801912) write (8614945735,19783,614) delete (12454423,0,1,17672660) udf (0,0,0) lang (0,0,0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992510767,0,0,3020503)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282186649,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:46:02 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242631300 total) msec
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614965518 total) msec
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282198814 total) msec
Jun 27 2022 16:46:02 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195926482 total) count
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:389) {ourdb} objects: all 34264857 master 13310151 prole 20954707 non-replica 0
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:444) {ourdb} migrations: remaining (250,208,500) active (1,1,0) complete-pct 8.03
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:465) {ourdb} memory-usage: total-bytes 63439306974 index-bytes 2192950848 sindex-bytes 2209511876 data-bytes 59036844250 used-pct 65.65
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:517) {ourdb} device-usage: used-bytes 60951649392 avail-pct 95
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:585) {ourdb} client: tsvc (0,3) proxy (555,0,32) read (39500831663,0,0,741802051) write (8614946557,19783,614) delete (12454424,0,1,17672660) udf (0,0,0) lang (0,0,0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992511463,0,0,3020503)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282192374,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:46:12 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242633714 total) msec
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614966340 total) msec
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282204539 total) msec
Jun 27 2022 16:46:12 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195927460 total) count
Jun 27 2022 16:46:12 GMT: INFO (drv_ssd): (drv_ssd.c:2185) {ourdb} /dev/vdb: used-bytes 60958933328 free-wblocks 2007948 write-q 0 write (548458161,32.8) defrag-q 0 defrag-read (548368742,8.3) defrag-write (265090646,4.1)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:389) {ourdb} objects: all 34386265 master 13310337 prole 21075928 non-replica 0
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:444) {ourdb} migrations: remaining (246,204,492) active (0,0,0) complete-pct 9.64
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:465) {ourdb} memory-usage: total-bytes 63669232448 index-bytes 2200720960 sindex-bytes 2213696484 data-bytes 59254815004 used-pct 65.89
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:517) {ourdb} device-usage: used-bytes 61176408480 avail-pct 95
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:585) {ourdb} client: tsvc (0,3) proxy (555,0,32) read (39500840040,0,0,741802370) write (8614948785,19783,614) delete (12454426,0,1,17672660) udf (0,0,0) lang (0,0,0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:635) {ourdb} batch-sub: tsvc (0,0) proxy (36,0,12) read (2992512736,0,0,3020503)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:662) {ourdb} scan: basic (90214,1620,0) aggr (0,0,0) udf-bg (0,0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:686) {ourdb} query: basic (55282207925,12165) aggr (0,0) udf-bg (0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:750) {ourdb} retransmits: migration 12208 client-read 0 client-write (0,60) client-delete (0,0) client-udf (0,0) batch-sub 0 udf-sub (0,0)
Jun 27 2022 16:46:22 GMT: INFO (info): (ticker.c:785) {ourdb} special-errors: key-busy 63 record-too-big 17918
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-read (40242642410 total) msec
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-write (8614968568 total) msec
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query (55282220090 total) msec
Jun 27 2022 16:46:22 GMT: INFO (info): (hist.c:240) histogram dump: {ourdb}-query-rec-count (10195929923 total) count