OOM killed. How to file a bug properly


#1

Hi I’m facing a memory leak with the latest aerospike 3.11.0.1 The last log lines about memory are:

Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:428) {primary} memory-usage: total-bytes 2775205050 index-bytes 1113077632 sindex-bytes 1804204 data-bytes 1660323214 used-pct 11.24
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:458) {primary} device-usage: used-bytes 4769849856 avail-pct 94
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:551) {primary} client: tsvc (0,7139) proxy (3345,0,0) read (1232971977,0,0,313365859) write (2808217,15,41) delete (736,0,0,50223) udf (0,0,0) lang (0,0,0,0)
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:632) {primary} query: basic (13292712,1319) aggr (0,0) udf-bg (0,0)

and then in dmesg:

[Thu Jan 12 07:18:46 2017 <73245.073014>] asd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[Thu Jan 12 07:18:46 2017 <    0.000005>] asd cpuset=/ mems_allowed=0
[Thu Jan 12 07:18:46 2017 <    0.000004>] CPU: 3 PID: 10187 Comm: asd Tainted: G           OX 3.13.0-107-generic #154-Ubuntu
[Thu Jan 12 07:18:46 2017 <    0.000001>] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[Thu Jan 12 07:18:46 2017 <    0.000002>]  0000000000000000 ffff8807857db970 ffffffff8172d229 ffff880784463000
[Thu Jan 12 07:18:46 2017 <    0.000004>]  0000000000000000 ffff8807857db9f8 ffffffff817277c8 0000000000000000
[Thu Jan 12 07:18:46 2017 <    0.000003>]  0000000000000000 0000000000000012 0000000000000001 ffffffff81c3fa60
[Thu Jan 12 07:18:46 2017 <    0.000004>] Call Trace:
[Thu Jan 12 07:18:46 2017 <    0.000008>]  [<ffffffff8172d229>] dump_stack+0x64/0x82
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff817277c8>] dump_header+0x7f/0x1f1
[Thu Jan 12 07:18:46 2017 <    0.000005>]  [<ffffffff81156d81>] oom_kill_process+0x201/0x360
[Thu Jan 12 07:18:46 2017 <    0.000005>]  [<ffffffff812dde55>] ? security_capable_noaudit+0x15/0x20
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff81157511>] out_of_memory+0x471/0x4b0
[Thu Jan 12 07:18:46 2017 <    0.000004>]  [<ffffffff8115d82c>] __alloc_pages_nodemask+0xa6c/0xb90
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff8119bf93>] alloc_pages_current+0xa3/0x160
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff81153837>] __page_cache_alloc+0x97/0xc0
[Thu Jan 12 07:18:46 2017 <    0.000002>]  [<ffffffff811552a5>] filemap_fault+0x185/0x410
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff8117a4bf>] __do_fault+0x6f/0x530
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff8117e440>] handle_mm_fault+0x5b0/0xf10
[Thu Jan 12 07:18:46 2017 <    0.000002>]  [<ffffffff811531f0>] ? sleep_on_page+0x20/0x20
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff817393c4>] __do_page_fault+0x184/0x560
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff81616bc7>] ? SYSC_recvfrom+0x127/0x160
[Thu Jan 12 07:18:46 2017 <    0.000004>]  [<ffffffff8120be90>] ? SyS_epoll_ctl+0xf0/0x9f0
[Thu Jan 12 07:18:46 2017 <    0.000002>]  [<ffffffff817397ba>] do_page_fault+0x1a/0x70
[Thu Jan 12 07:18:46 2017 <    0.000003>]  [<ffffffff81735ae8>] page_fault+0x28/0x30
[Thu Jan 12 07:18:46 2017 <    0.000001>] Mem-Info:
[Thu Jan 12 07:18:46 2017 <    0.000002>] Node 0 DMA per-cpu:
[Thu Jan 12 07:18:46 2017 <    0.000002>] CPU    0: hi:    0, btch:   1 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    1: hi:    0, btch:   1 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    2: hi:    0, btch:   1 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    3: hi:    0, btch:   1 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000000>] Node 0 DMA32 per-cpu:
[Thu Jan 12 07:18:46 2017 <    0.000002>] CPU    0: hi:  186, btch:  31 usd:  30
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    1: hi:  186, btch:  31 usd:  97
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    2: hi:  186, btch:  31 usd:  19
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    3: hi:  186, btch:  31 usd:   8
[Thu Jan 12 07:18:46 2017 <    0.000001>] Node 0 Normal per-cpu:
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    0: hi:  186, btch:  31 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    1: hi:  186, btch:  31 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    2: hi:  186, btch:  31 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000001>] CPU    3: hi:  186, btch:  31 usd:   0
[Thu Jan 12 07:18:46 2017 <    0.000003>] active_anon:7670523 inactive_anon:74 isolated_anon:0
[Thu Jan 12 07:18:46 2017 <    0.000000>]  active_file:113 inactive_file:0 isolated_file:0
[Thu Jan 12 07:18:46 2017 <    0.000000>]  unevictable:1165 dirty:5 writeback:0 unstable:0
[Thu Jan 12 07:18:46 2017 <    0.000000>]  free:40535 slab_reclaimable:8114 slab_unreclaimable:17388
[Thu Jan 12 07:18:46 2017 <    0.000000>]  mapped:870 shmem:2255 pagetables:16505 bounce:0
[Thu Jan 12 07:18:46 2017 <    0.000000>]  free_cma:0
[Thu Jan 12 07:18:46 2017 <    0.000002>] Node 0 DMA free:15904kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Thu Jan 12 07:18:46 2017 <    0.000003>] lowmem_reserve[]: 0 3744 30661 30661
[Thu Jan 12 07:18:46 2017 <    0.000003>] Node 0 DMA32 free:115872kB min:8248kB low:10308kB high:12372kB active_anon:3681588kB inactive_anon:8kB active_file:160kB inactive_file:0kB unevictable:276kB isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3836704kB mlocked:276kB dirty:0kB writeback:0kB mapped:204kB shmem:1148kB slab_reclaimable:3300kB slab_unreclaimable:7128kB kernel_stack:272kB pagetables:7048kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1266 all_unreclaimable? yes
[Thu Jan 12 07:18:46 2017 <    0.000003>] lowmem_reserve[]: 0 0 26917 26917
[Thu Jan 12 07:18:46 2017 <    0.000002>] Node 0 Normal free:30364kB min:59300kB low:74124kB high:88948kB active_anon:27000504kB inactive_anon:288kB active_file:292kB inactive_file:0kB unevictable:4384kB isolated(anon):0kB isolated(file):0kB present:28067840kB managed:27563592kB mlocked:4384kB dirty:20kB writeback:0kB mapped:3276kB shmem:7872kB slab_reclaimable:29156kB slab_unreclaimable:62424kB kernel_stack:1824kB pagetables:58972kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:528 all_unreclaimable? yes
[Thu Jan 12 07:18:46 2017 <    0.000004>] lowmem_reserve[]: 0 0 0 0
[Thu Jan 12 07:18:46 2017 <    0.000002>] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15904kB
[Thu Jan 12 07:18:46 2017 <    0.000008>] Node 0 DMA32: 1588*4kB (UEM) 911*8kB (UEM) 405*16kB (UEM) 232*32kB (UEM) 153*64kB (UEM) 96*128kB (UEM) 67*256kB (UEM) 30*512kB (UEM) 29*1024kB (UM) 0*2048kB 1*4096kB (M) = 115928kB
[Thu Jan 12 07:18:46 2017 <    0.000009>] Node 0 Normal: 761*4kB (UEM) 553*8kB (UEM) 309*16kB (UEM) 139*32kB (UEM) 57*64kB (EM) 23*128kB (EM) 28*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 30620kB
[Thu Jan 12 07:18:46 2017 <    0.000009>] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Thu Jan 12 07:18:46 2017 <    0.000001>] 3317 total pagecache pages
[Thu Jan 12 07:18:46 2017 <    0.000001>] 0 pages in swap cache
[Thu Jan 12 07:18:46 2017 <    0.000001>] Swap cache stats: add 0, delete 0, find 0/0
[Thu Jan 12 07:18:46 2017 <    0.000001>] Free swap  = 0kB
[Thu Jan 12 07:18:46 2017 <    0.000001>] Total swap = 0kB
[Thu Jan 12 07:18:46 2017 <    0.000001>] 7999901 pages RAM
[Thu Jan 12 07:18:46 2017 <    0.000000>] 0 pages HighMem/MovableOnly
[Thu Jan 12 07:18:46 2017 <    0.000001>] 126062 pages reserved
[Thu Jan 12 07:18:46 2017 <    0.000001>] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[Thu Jan 12 07:18:46 2017 <    0.000004>] [  502]     0   502     5000      214      15        0             0 upstart-udev-br
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  505]     0   505    12392      288      27        0         -1000 systemd-udevd
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  644]     0   644     3814      100      12        0             0 upstart-socket-
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  676]     0   676     2554      645       8        0             0 dhclient
[Thu Jan 12 07:18:46 2017 <    0.000001>] [  820]   102   820     9802      203      24        0             0 dbus-daemon
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  842]     0   842    10861      271      26        0             0 systemd-logind
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  845]   101   845    65114     4967      36        0             0 rsyslogd
[Thu Jan 12 07:18:46 2017 <    0.000001>] [  868]     0   868     3818      112      12        0             0 upstart-file-br
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  947]     0   947     3633      204      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  950]     0   950     3633      206      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000001>] [  954]     0   954     3633      205      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000002>] [  956]     0   956     3633      203      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000001>] [  958]     0   958     3633      203      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1002]     0  1002     4783       41      14        0             0 atd
[Thu Jan 12 07:18:46 2017 <    0.000001>] [ 1003]     0  1003     5912      219      17        0             0 cron
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1013]     0  1013    15344      355      34        0         -1000 sshd
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1038]     0  1038     4821      164      14        0             0 irqbalance
[Thu Jan 12 07:18:46 2017 <    0.000001>] [ 1049]     0  1049     1091      155       8        0             0 acpid
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1137]     0  1137     3633      205      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000001>] [ 1138]     0  1138     3195      189      12        0             0 getty
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1276]  1000  1276     7166     1366      17        0             0 tmux
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1277]  1000  1277     5291      685      15        0             0 bash
[Thu Jan 12 07:18:46 2017 <    0.000001>] [ 1296]     0  1296    16997      311      36        0             0 sudo
[Thu Jan 12 07:18:46 2017 <    0.000002>] [ 1297]     0  1297     5310      739      14        0             0 bash
[Thu Jan 12 07:18:46 2017 <    0.000002>] [10024]     0 10024  8848258  7649467   15938        0             0 asd
[Thu Jan 12 07:18:46 2017 <    0.000001>] [15399]     0 15399    15353     2612      34        0             0 gunicorn: maste
[Thu Jan 12 07:18:46 2017 <    0.000002>] [15405]     0 15405    19634     5416      42        0             0 gunicorn: worke
[Thu Jan 12 07:18:46 2017 <    0.000001>] [15408]     0 15408     1110       33       6        0             0 amc
[Thu Jan 12 07:18:46 2017 <    0.000002>] [15409]     0 15409     1110       57       6        0             0 amc
[Thu Jan 12 07:18:46 2017 <    0.000002>] [15415]     0 15415     1110      147       7        0             0 sh
[Thu Jan 12 07:18:46 2017 <    0.000001>] [19805]     0 19805     4235     1167      13        0             0 atop
[Thu Jan 12 07:18:46 2017 <    0.000002>] [17378]   999 17378    13695     1932      31        0             0 python
[Thu Jan 12 07:18:46 2017 <    0.000001>] [28982]     0 28982     1478      138       7        0             0 sleep
[Thu Jan 12 07:18:46 2017 <    0.000002>] Out of memory: Kill process 10024 (asd) score 946 or sacrifice child
[Thu Jan 12 07:18:46 2017 <    0.004401>] Killed process 10024 (asd) total-vm:35393032kB, anon-rss:30597868kB, file-rss:0kB

the config is the following:

service {
    user root
    group root
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    paxos-protocol v4
    pidfile /var/run/aerospike/asd.pid
    service-threads 2
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 90000
    paxos-max-cluster-size 16
    node-id-interface eth0
}

cluster {
    mode dynamic
    self-group-id 1
}

logging {
    file /var/log/aerospike/aerospike.log {
        context any info
    }
}

network {
    service {
        address 172.31.34.199
        port 3000
        access-address 172.31.34.199
   }

    heartbeat {
        mode mesh
        port 3002
#        mesh-seed-address-port 172.31.37.249 3002
        interval 150
        timeout 40
    }

    fabric {
	address 172.31.34.199
        port 3001
    }

    info {
	address 172.31.34.199
        port 3003
    }
}

namespace primary {
    replication-factor 2
    memory-size 23G
    default-ttl 90d
    stop-writes-pct 90
    high-water-memory-pct 85
    high-water-disk-pct 80
    write-commit-level-override master
    storage-engine device {
        device /dev/xvdf
        defrag-lwm-pct 50
        scheduler-mode noop
        write-block-size 1M
	    data-in-memory true
    }
}

it’s an r3.xlarge instance with 30G of RAM. How do I properly catch the memory leak in order to help you to fix it? It does’t happen without “data-in-memory true” option


#2

What is top showing and how does it evolve? There should also be a line in the logs showing the heap memory allocation which may be useful to track how it evolves as well over time to see if there is any correlation with specific traffic patterns.


#3

I have an atop log (with 10min resolution). The memory usage just grows over 30min until OOM killed. The last heap-related lines are

Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:169) NODE-ID bb90001ac1f22c7 CLUSTER-SIZE 1
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:249)    system-memory: free-kbytes 766864 free-pct 2 heap-kbytes (32910590,32964952,33386496) heap-efficiency-pct 98.6
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:263)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:285)    fds: proto (19528,9708039,9688511) heartbeat (0,2,2) fabric (16,124,108)
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:294)    heartbeat-received: self 0 foreign 60023
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:348) {primary} objects: all 17391838 master 17391838 prole 0
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:409) {primary} migrations: complete
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:428) {primary} memory-usage: total-bytes 2775205050 index-bytes 1113077632 sindex-bytes 1804204 data-bytes 1660323214 used-pct 11.24
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:458) {primary} device-usage: used-bytes 4769849856 avail-pct 94
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:551) {primary} client: tsvc (0,7139) proxy (3345,0,0) read (1232971977,0,0,313365859) write (2808217,15,41) delete (736,0,0,50223) udf (0,0,0) lang (0,0,0,0)
Jan 12 2017 07:18:45 GMT: INFO (info): (ticker.c:632) {primary} query: basic (13292712,1319) aggr (0,0) udf-bg (0,0)
Jan 12 2017 07:18:45 GMT: INFO (info): (hist.c:145) histogram dump: {primary}-read (1546337836 total) msec

Some earlier lines:

Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:169) NODE-ID bb90001ac1f22c7 CLUSTER-SIZE 1
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:249)    system-memory: free-kbytes 2651508 free-pct 8 heap-kbytes (29302939,29356692,29779456) heap-efficiency-pct 98.4
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:263)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:285)    fds: proto (18900,9605824,9586924) heartbeat (0,2,2) fabric (16,124,108)
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:294)    heartbeat-received: self 0 foreign 60023
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:348) {primary} objects: all 17391483 master 17391483 prole 0
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:409) {primary} migrations: complete
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:428) {primary} memory-usage: total-bytes 2770041037 index-bytes 1113054912 sindex-bytes 1804204 data-bytes 1655181921 used-pct 11.22
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:458) {primary} device-usage: used-bytes 4764649472 avail-pct 94
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:551) {primary} client: tsvc (0,7139) proxy (3345,0,0) read (1229754833,0,0,313133435) write (2803717,15,41) delete (736,0,0,50223) udf (0,0,0) lang (0,0,0,0)
Jan 12 2017 07:15:15 GMT: INFO (info): (ticker.c:632) {primary} query: basic (13260484,1319) aggr (0,0) udf-bg (0,0)

As I said it was growing. I can upload the entire file


#4

Looks like you are doing around 140 secondary index queries per second and this would be the first suspect.

Could you specify what type of queries you are running against what type of secondary index?


#5

I’ll try to find out with our developer. We actively use Geo queries AFAIK. I’ll get back to you as soon as I have something. Would be superb if there is a ready-to-use way to capture queries with a libpcap powered tool or at least tcpdump with stripped length like we can do for http headers for instance.


#6

Thanks. Geo queries would definitely be a strong candidate for causing this. We don’t have such tool currently, would be neat indeed. Yes, do share more details on the queries and we can also discuss how to get a sample data set on our side to try to reproduce. Thanks again.


#7

Here you can find the code that created and reads AS items the way we use it. We created a code that will generate data and then the reader will read it. In our case, it leads to heap exhaustion about once in a week. I found that it happens the same way with data-in-memory and without it. But with data-in-memory we faced it more frequently. Probably you should try with less memory available on the test intstance


#8

Thank you! Passing this along to our developers.