High memory usage on Kubernetes GKE using helm chart

#1

Hello,

I have installed aerospike community 4.5.0.5 using helm chart available on stable repository https://github.com/helm/charts/tree/master/stable/aerospike (chart v0.2.3)

I have 3 dedicated nodes (with pod replicas 6) for aerospike with n1-highmem-2 machine type with Kubernetes (GKE) version 1.12.5-gke.5 on container-optimized os, and I use the following parameters

replicaCount: 6

resources:
    requests:
        cpu: 740m
        memory: 5G

namespace ssd {
    replication-factor 2
    memory-size 4G
    default-ttl 0
    high-water-memory-pct 80
    stop-writes-pct 90
    storage-engine device {
    file /opt/aerospike/data/ssd.dat
        filesize 200G
    }
}

I have minimal deployments (using tolerations) on my nodes to be safe. On kubernetes monitoring memory, the memory seems to increase by itself (no increase in the number of records since the beginning of the curve)

The memory usage view by aerospike seems OK

$ asadm -e "info"
Seed:        [('127.0.0.1', 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information (2019-03-25 10:33:17 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                               Node               Node                 Ip       Build   Cluster   Migrations        Cluster     Cluster         Principal   Client     Uptime
                                                                  .                 Id                  .           .      Size            .            Key   Integrity                 .    Conns          .
aero-aerospike-0.aero-aerospike-mesh.default.svc.cluster.local:3000   BB9068A340A580A    10.52.138.6:3000   C-4.5.0.5         6      0.000     8F4213AA394F   True        BB9078B340A580A      584   67:45:00
aero-aerospike-1.aero-aerospike-mesh.default.svc.cluster.local:3000   BB9068B340A580A    10.52.139.6:3000   C-4.5.0.5         6      0.000     8F4213AA394F   True        BB9078B340A580A      580   67:44:09
aero-aerospike-2.aero-aerospike-mesh.default.svc.cluster.local:3000   BB9058C340A580A    10.52.140.5:3000   C-4.5.0.5         6      0.000     8F4213AA394F   True        BB9078B340A580A      585   67:43:18
aero-aerospike-3.aero-aerospike-mesh.default.svc.cluster.local:3000   *BB9078B340A580A   10.52.139.7:3000   C-4.5.0.5         6      0.000     8F4213AA394F   True        BB9078B340A580A      582   67:42:31
aero-aerospike-4.aero-aerospike-mesh.default.svc.cluster.local:3000   BB9078A340A580A    10.52.138.7:3000   C-4.5.0.5         6      0.000     8F4213AA394F   True        BB9078B340A580A      584   67:41:53
aero-aerospike-5.aero-aerospike-mesh.default.svc.cluster.local:3000   BB9068C340A580A    10.52.140.6:3000   C-4.5.0.5         6      0.000     8F4213AA394F   True        BB9078B340A580A      587   67:41:17
Number of rows: 6

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2019-03-25 10:33:17 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                                  Node       Total   Expirations,Evictions     Stop         Disk    Disk     HWM   Avail%         Mem     Mem    HWM      Stop
        .                                                                     .     Records                       .   Writes         Used   Used%   Disk%        .        Used   Used%   Mem%   Writes%
ssd         aero-aerospike-0.aero-aerospike-mesh.default.svc.cluster.local:3000    30.999 M   (0.000,  0.000)         false     31.156 GB   16      50      80        1.848 GB   47      80     90
ssd         aero-aerospike-1.aero-aerospike-mesh.default.svc.cluster.local:3000    31.994 M   (0.000,  0.000)         false     32.153 GB   17      50      79        1.907 GB   48      80     90
ssd         aero-aerospike-2.aero-aerospike-mesh.default.svc.cluster.local:3000    32.264 M   (0.000,  0.000)         false     32.427 GB   17      50      79        1.923 GB   49      80     90
ssd         aero-aerospike-3.aero-aerospike-mesh.default.svc.cluster.local:3000    31.966 M   (0.000,  0.000)         false     32.126 GB   17      50      79        1.905 GB   48      80     90
ssd         aero-aerospike-4.aero-aerospike-mesh.default.svc.cluster.local:3000    31.822 M   (0.000,  0.000)         false     31.981 GB   16      50      79        1.897 GB   48      80     90
ssd         aero-aerospike-5.aero-aerospike-mesh.default.svc.cluster.local:3000    33.645 M   (0.000,  0.000)         false     33.817 GB   17      50      78        2.005 GB   51      80     90
ssd                                                                               192.690 M   (0.000,  0.000)                  193.660 GB                            11.485 GB
Number of rows: 7

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2019-03-25 10:33:17 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                                  Node       Total     Repl                         Objects                   Tombstones             Pending   Rack
        .                                                                     .     Records   Factor      (Master,Prole,Non-Replica)   (Master,Prole,Non-Replica)            Migrates     ID
        .                                                                     .           .        .                               .                            .             (tx,rx)      .
ssd         aero-aerospike-0.aero-aerospike-mesh.default.svc.cluster.local:3000    30.999 M   2        (15.592 M, 15.407 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
ssd         aero-aerospike-1.aero-aerospike-mesh.default.svc.cluster.local:3000    31.994 M   2        (15.482 M, 16.511 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
ssd         aero-aerospike-2.aero-aerospike-mesh.default.svc.cluster.local:3000    32.264 M   2        (16.320 M, 15.944 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
ssd         aero-aerospike-3.aero-aerospike-mesh.default.svc.cluster.local:3000    31.966 M   2        (15.756 M, 16.210 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
ssd         aero-aerospike-4.aero-aerospike-mesh.default.svc.cluster.local:3000    31.822 M   2        (15.825 M, 15.997 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
ssd         aero-aerospike-5.aero-aerospike-mesh.default.svc.cluster.local:3000    33.645 M   2        (17.369 M, 16.276 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0
ssd                                                                               192.690 M            (96.345 M, 96.345 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)
Number of rows: 7

Result of distribution command

$ asadm -e "show distribution"
Seed:        [('127.0.0.1', 3000, None)]
Config_file: /root/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ssd - TTL Distribution in Seconds (2019-03-25 10:39:52 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        Percentage of records having ttl less than or equal to value measured in Seconds
                                                               Node   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%
aero-aerospike-0.aero-aerospike-mesh.default.svc.cluster.local:3000     0     0     0     0     0     0     0     0     0      0
aero-aerospike-1.aero-aerospike-mesh.default.svc.cluster.local:3000     0     0     0     0     0     0     0     0     0      0
aero-aerospike-2.aero-aerospike-mesh.default.svc.cluster.local:3000     0     0     0     0     0     0     0     0     0      0
aero-aerospike-3.aero-aerospike-mesh.default.svc.cluster.local:3000     0     0     0     0     0     0     0     0     0      0
aero-aerospike-4.aero-aerospike-mesh.default.svc.cluster.local:3000     0     0     0     0     0     0     0     0     0      0
aero-aerospike-5.aero-aerospike-mesh.default.svc.cluster.local:3000     0     0     0     0     0     0     0     0     0      0
Number of rows: 6

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ssd - Object Size Distribution in bytes (2019-03-25 10:39:52 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             Percentage of records having objsz less than or equal to value measured in bytes
                                                               Node    10%    20%    30%    40%    50%    60%    70%    80%    90%    100%
aero-aerospike-0.aero-aerospike-mesh.default.svc.cluster.local:3000   1023   1023   1023   1023   2047   2047   2047   2047   2047   88063
aero-aerospike-1.aero-aerospike-mesh.default.svc.cluster.local:3000   1023   1023   1023   1023   2047   2047   2047   2047   2047   89087
aero-aerospike-2.aero-aerospike-mesh.default.svc.cluster.local:3000   1023   1023   1023   1023   2047   2047   2047   2047   2047   79871
aero-aerospike-3.aero-aerospike-mesh.default.svc.cluster.local:3000   1023   1023   1023   1023   2047   2047   2047   2047   2047   57343
aero-aerospike-4.aero-aerospike-mesh.default.svc.cluster.local:3000   1023   1023   1023   1023   2047   2047   2047   2047   2047   89087
aero-aerospike-5.aero-aerospike-mesh.default.svc.cluster.local:3000   1023   1023   1023   1023   2047   2047   2047   2047   2047   60415
Number of rows: 6

It looks like a lot of memory allocation but I am very worried about eviction as there is a pod that exceeds the requested value

Do you recommend more memory requested by pod ? What is the recommended margin memory ? Is /sys/fs/cgroup/memory/memory.limit_in_bytes read by aerospike (where memory limit is set) and so do you suggest to set the memory limit on kubernetes spec (can be dangerous?) ?

Thanks for help !

#2

How long was it since the last write when these charts started climbing? What does the memory-heap line say in aerospike.log?

#3

Before charts started climbing we used a batch job to insert data. We have now a streaming job to do updates of existing keys (between 100 and 1000/s) so we have constantly writings but no significant rise in the number of records.

This is an extract of logs

Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:160) NODE-ID bb9068c340a580a CLUSTER-SIZE 6
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:238)    cluster-clock: skew-ms 0
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:265)    system-memory: free-kbytes 7828996 free-pct 58 heap-kbytes (3631196,3640200,4105216) heap-efficiency-pct 88.5
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:278)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:299)    fds: proto (529,114887,114358) heartbeat (5,5,0) fabric (120,120,0)
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:307)    heartbeat-received: self 0 foreign 16635052
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:337)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (481,216)
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:376)    batch-index: batches (19203769,0,0)
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:240) histogram dump: batch-index (19203769 total) msec
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:257)  (00: 0017472846) (01: 0001462540) (02: 0000229776) (03: 0000030990)
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:257)  (04: 0000006216) (05: 0000000989) (06: 0000000166) (07: 0000000112)
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:266)  (08: 0000000093) (09: 0000000032) (10: 0000000009)
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:389) {ssd} objects: all 33787000 master 17441874 prole 16345126 non-replica 0
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:447) {ssd} migrations: complete
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:474) {ssd} memory-usage: total-bytes 2162368000 index-bytes 2162368000 sindex-bytes 0 used-pct 50.35
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:535) {ssd} device-usage: used-bytes 36427845024 avail-pct 77 cache-read-pct 0.00
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:585) {ssd} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (85945228,0,0) delete (652767,0,0,1849983) udf (0,0,0) lang (0,0,0,0)
Mar 28 2019 09:44:54 GMT: INFO (info): (ticker.c:635) {ssd} batch-sub: tsvc (0,0) proxy (32324,0,0) read (56883427,0,0,880485)
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:240) histogram dump: {ssd}-write (85945228 total) msec
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:257)  (00: 0080615443) (01: 0003528366) (02: 0001282688) (03: 0000440825)
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:257)  (04: 0000074106) (05: 0000003389) (06: 0000000240) (07: 0000000136)
Mar 28 2019 09:44:54 GMT: INFO (info): (hist.c:266)  (08: 0000000034) (09: 0000000001)
Mar 28 2019 09:44:57 GMT: INFO (drv_ssd): (drv_ssd.c:2185) {ssd} /opt/aerospike/data/ssd.dat: used-bytes 36427845024 free-wblocks 158290 write-q 0 write (293410,0.2) defrag-q 0 defrag-read (246883,0.2) defrag-write (122684,0.1)
Mar 28 2019 09:44:58 GMT: WARNING (hardware): (hardware.c:2258) failed to resolve mounted device /dev/sdc: 2 (No such file or directory)
Mar 28 2019 09:45:03 GMT: WARNING (hardware): (hardware.c:2258) failed to resolve mounted device /dev/sdc: 2 (No such file or directory)
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:160) NODE-ID bb9068c340a580a CLUSTER-SIZE 6
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:238)    cluster-clock: skew-ms 0
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:265)    system-memory: free-kbytes 7829024 free-pct 58 heap-kbytes (3631410,3640680,4105216) heap-efficiency-pct 88.5
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:278)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:299)    fds: proto (531,114890,114359) heartbeat (5,5,0) fabric (120,120,0)
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:307)    heartbeat-received: self 0 foreign 16635385
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:337)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (7656,6127)
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:376)    batch-index: batches (19204276,0,0)
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:240) histogram dump: batch-index (19204276 total) msec
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:257)  (00: 0017473319) (01: 0001462569) (02: 0000229779) (03: 0000030992)
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:257)  (04: 0000006216) (05: 0000000989) (06: 0000000166) (07: 0000000112)
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:266)  (08: 0000000093) (09: 0000000032) (10: 0000000009)
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:389) {ssd} objects: all 33786984 master 17441867 prole 16345117 non-replica 0
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:447) {ssd} migrations: complete
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:474) {ssd} memory-usage: total-bytes 2162366976 index-bytes 2162366976 sindex-bytes 0 used-pct 50.35
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:535) {ssd} device-usage: used-bytes 36427822288 avail-pct 77 cache-read-pct 0.00
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:585) {ssd} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (85945366,0,0) delete (652774,0,0,1849987) udf (0,0,0) lang (0,0,0,0)
Mar 28 2019 09:45:04 GMT: INFO (info): (ticker.c:635) {ssd} batch-sub: tsvc (0,0) proxy (32324,0,0) read (56884974,0,0,880498)
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:240) histogram dump: {ssd}-write (85945366 total) msec
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:257)  (00: 0080615556) (01: 0003528380) (02: 0001282696) (03: 0000440827)
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:257)  (04: 0000074107) (05: 0000003389) (06: 0000000240) (07: 0000000136)
Mar 28 2019 09:45:04 GMT: INFO (info): (hist.c:266)  (08: 0000000034) (09: 0000000001)
Mar 28 2019 09:45:08 GMT: WARNING (hardware): (hardware.c:2258) failed to resolve mounted device /dev/sdc: 2 (No such file or directory)
#4

I’m a bit confused by this… Why is it looking for sdc when you only have a file specified? Are there multiple namespaces?

I suspect it might have something to do with using a file… The kernel might be trying to cache things here. While the memory usage is high, what does it drop down to if you run echo 3 > /proc/sys/vm/drop_caches ?

https://www.aerospike.com/docs/reference/configuration/index.html#direct-files might be a solution, but im not certain

#5

I have one single namespace (called ssd). Warnings seams appear since helm chart update that upgrade aerospike version 3.14.1.2 to 4.5.0.5 (https://github.com/helm/charts/commit/dbc29ce3fe1093298443532d342a1ecde2475b44#diff-46d35d6383b4ab84a77ca2078336b90e). But I had same issue when I used version 3.

This is the complete config file

# aerospike configuration
    #default config file
service {
    user root
    group root
    paxos-protocol v5
    paxos-single-replica-limit 1
    pidfile /var/run/aerospike/asd.pid
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
    proto-fd-max 15000
}

logging {
    console {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }
    heartbeat {
        address any
        interval 150

    mesh-seed-address-port aero-aerospike-0.aero-aerospike-mesh 3002
    mesh-seed-address-port aero-aerospike-1.aero-aerospike-mesh 3002
    mesh-seed-address-port aero-aerospike-2.aero-aerospike-mesh 3002
    mesh-seed-address-port aero-aerospike-3.aero-aerospike-mesh 3002
    mesh-seed-address-port aero-aerospike-4.aero-aerospike-mesh 3002
    mesh-seed-address-port aero-aerospike-5.aero-aerospike-mesh 3002
        mode mesh
        port 3002
        timeout 20
        protocol v3
    }
    fabric {
        port 3001
    }
    info {
        port 3003
    }
}

namespace ssd {
    replication-factor 2 # For multiple nodes, keep 2 copies of the data
    memory-size 4G # 4GB of memory to be used for index and data
    default-ttl 0 # Writes from client that do not provide a TTL. Will default to 0 or never expire
    high-water-memory-pct 80 # Evict non-zero TTL data if capacity exceeds 60% of 4GB
    stop-writes-pct 90 # Stop writes if capacity exceeds 90% of 4GB
    storage-engine device {
      file /opt/aerospike/data/ssd.dat
      filesize 200G
    }
}

And this is the result of command to free cache memory (run from container)

root@aero-aerospike-5:/# echo 3 > /proc/sys/vm/drop_caches
bash: /proc/sys/vm/drop_caches: Read-only file system

I will try direct-files and read-page-cache configs, is there cons?

#6

Aerospike determines the underlying device so that it can monitor the device’s health stats as well as set the scheduler if a scheduler was selected in the Aerospike config.

In this case, I suspect the container doesn’t have permission to access the raw device. For this reason, other places have made this a detail instead of warning - I suspect this warning should also become a detail. @tlo?

1 Like
#7

Depends on the workload. In some cases read-page-cache might help improve performance, but this is usually only the case if you are reading the same objects over and over. Disabling it will give you more control over the memory util though. Most configurations just use raw devices though and don’t use read-page-cache at all. So the most common setup doesn’t use this. Let us know if it helps!

#8

Oh, yes, that warning looks odd. It’s probably unrelated to the memory issue, but we should still fix it.

Here’s what seems to be going on:

Aerospike tries to find the underlying physical device for /opt/aerospike/data/ssd.dat in order to set its scheduler. This process goes from the file’s path to the mount point of the file’s file system, to the block device of the mounted partition, to the underlying logical device (e.g., LVM or DM), to the physical device.

Here, something seems to go wrong in the step that goes from the file path to the mount point. /proc/mounts seems to indicate that /opt/aerospike/data/ssd.dat sits on a file system on /dev/sdc. But when Aerospike tries to look at /dev/sdc, it seems that /dev/sdc doesn’t actually exist under /dev.

Could this be a container-related issue, i.e., that inside the container /proc/mounts indicates the correct mount point but /dev doesn’t contain the corresponding block device?

If you are interested in looking into this (I certainly am, even though it is probably unrelated to the memory issue), I’d like to see the output of these two commands inside the container, please:

  • cat /proc/mounts
  • ls -l /dev
#9

Careful, though, with read-page-cache: When using this option, the kernel would again spend memory on caching read data in the page cache. As a first test, I’d rather use direct-files without read-page-cache. This then completely bypasses the page cache and gives you a base line measurement for memory usage without page cache use.

Question: What exactly does the memory consumption reported by the reporting tool include?

From the htop output (run outside the containers, I guess) it seems that the asd processes take up a bit over 4,000M of memory. This is in line with the “system-memory” line in the log file you provided:

system-memory: [...] heap-kbytes (3631410,3640680,4105216) [...]

The last of the three numbers, which is the amount of memory asd got from the Linux kernel, roughly corresponds to the ~4,000M of memory indicated in htop.

So, my guess would be that these ~4,000M were lower in previous versions of Aerospike. It is conceivable that newer versions of Aerospike have slightly higher base line memory use.

I would thus share Albot’s suspicion that the page cache is included in the memory usage reported by your dashboard and that the page cache is what now pushes you over 5 GiB. Possibly, with an older version of Aerospike the page cache didn’t push you over the 5 GiB, because Aerospike’s base line memory use was a bit lower.

Let’s see what direct-files does for you.

(What’s odd, though: How does page cache use factor into a container’s memory use? The page cache is managed by the Linux kernel, so it’s… shared by all containers? Or is there a separate page cache per container? Maybe we’d have to check the Linux kernel code for that. Maybe the page cache is namespaced, just like process IDs, for example, where each container has its own process IDs.)

#10
  • cat /proc/mounts
overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/K2UPOYT3YRIUMYWTSUN4I4MHXI:/var/lib/docker/overlay2/l/7ARKPNYSF3ZFDLDNYF57RFYK3I:/var/lib/docker/overlay2/l/AN4NVMZW226HTYNTBGW77SRQCZ:/var/lib/docker/overlay2/l/LFD23XSMNJ66MYOXG7KEOXNUP7:/var/lib/docker/overlay2/l/4AE5Q5KKR5NXZHGU2RT53NQEQP,upperdir=/var/lib/docker/overlay2/d7754fdf0031e6eeba4458b06441057c3ed3a4c212e37c68307b09a5f38a5f24/diff,workdir=/var/lib/docker/overlay2/d7754fdf0031e6eeba4458b06441057c3ed3a4c212e37c68307b09a5f38a5f24/work 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/rdma cgroup ro,nosuid,nodev,noexec,relatime,rdma 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
/dev/sda1 /etc/aerospike ext4 ro,relatime,commit=30,data=ordered 0 0
/dev/sda1 /dev/termination-log ext4 rw,relatime,commit=30,data=ordered 0 0
/dev/sda1 /etc/resolv.conf ext4 rw,nosuid,nodev,relatime,commit=30,data=ordered 0 0
/dev/sda1 /etc/hostname ext4 rw,nosuid,nodev,relatime,commit=30,data=ordered 0 0
/dev/sda1 /etc/hosts ext4 rw,relatime,commit=30,data=ordered 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k 0 0
/dev/sdc /opt/aerospike/data ext4 rw,relatime,data=ordered 0 0
tmpfs /run/secrets/kubernetes.io/serviceaccount tmpfs ro,relatime 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/kcore tmpfs rw,nosuid,mode=755 0 0
tmpfs /proc/timer_list tmpfs rw,nosuid,mode=755 0 0
tmpfs /sys/firmware tmpfs ro,relatime 0 0
  • ls -l /dev
total 0
lrwxrwxrwx 1 root root   11 Mar 22 14:52 core -> /proc/kcore
lrwxrwxrwx 1 root root   13 Mar 22 14:52 fd -> /proc/self/fd
crw-rw-rw- 1 root root 1, 7 Mar 22 14:52 full
drwxrwxrwt 2 root root   40 Mar 22 14:51 mqueue
crw-rw-rw- 1 root root 1, 3 Mar 22 14:52 null
lrwxrwxrwx 1 root root    8 Mar 22 14:52 ptmx -> pts/ptmx
drwxr-xr-x 2 root root    0 Mar 22 14:52 pts
crw-rw-rw- 1 root root 1, 8 Mar 22 14:52 random
drwxrwxrwt 2 root root   40 Mar 22 14:51 shm
lrwxrwxrwx 1 root root   15 Mar 22 14:52 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root root   15 Mar 22 14:52 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root root   15 Mar 22 14:52 stdout -> /proc/self/fd/1
-rw-rw-rw- 1 root root    0 Mar 22 14:52 termination-log
crw-rw-rw- 1 root root 5, 0 Mar 22 14:52 tty
crw-rw-rw- 1 root root 1, 9 Mar 22 14:52 urandom
crw-rw-rw- 1 root root 1, 5 Mar 22 14:52 zero

I’m not sure that memory issue is related to aerospike version, I had same issue when I used version 3 (And I used ubuntu node instead of container-optimized os node). Actually raw device is not supported on Kubernetes GKE (1.12 version), but I hope 1.13 will be release soon to be able to change this config.

Yes, htop output was runned from the node. I have runned the command from container too, and the result is the same (so container views all node memory).

Memory is high but memory looks stable, priority for me is performance so that might be fine if it does not explode more. With this percentage of memory usage it’s a bit difficult to scale cluster using Kubernetes metrics.

Notice that I didn’t set Memory limit on Kubernetes, because if the memory exceeds this limit, the container will be restarted. Without this limit, I guess the container is not aware of how much memory it has. Question is (I’m really not an expert on all these topics ), if I set the limit, will the “page cache” you are talking about be aware of it? I read article about memory limits on Kubernetes, it can help:

Thanks for help, I will try with direct-files and I’ll let you know!

#11

Thanks for the contents of /proc/mounts and /dev. Yes, it does look like /dev/sdc holds /opt/aerospike/data, but is missing from /dev inside the container. Hence the warning.

I guess we should downgrade the warning to a debug message, as suggested by @kporter, for the benefit of container users.

I took a look at the cgroups documentation of the Linux kernel, which is the mechanism on which Docker builds when limiting container resources.

It does seem like the kernel does page cache accounting on a per-cgroup basis, i.e., even though all running containers share the kernel’s page cache, cached pages count against the limits of the container that caused them to be cached.

As far as I know you cannot specify a size for the page cache. The kernel basically says “Hey, I’ll just use the unused memory in the system. It’s just a cache, after all. So when a process needs more memory, I can always evict parts of the page cache to free up some memory.”

But how much unused memory does the kernel think there is? I.e., how big can the page cache grow?

Without containers, it’s pretty straight forward: the total amount of memory minus the amount of allocated memory. But what is the total amount of memory in a container?

My guess would be that this is what’s specified via the Kubernetes memory limit. So, if we set the Kubernetes memory limit to 5 GiB and Aerospike uses ~4 GiB, then I’d expect the container to not use more than the remaining 5 - 4 = 1 GiB of memory for the page cache. Thus, we’d effectively have limited the page cache size to 1 GiB.

But that’s just a guess. The above cgroups documentation doesn’t seem to say anything about this.

In any case, you can certainly keep the page cache utilization from growing by using direct-files. Then you probably won’t need to set a memory limit with Kubernetes.

#12

Using “direct-files” solves the issue :slight_smile: Thanks a lot ! Out of curiosity why this parameter is not enabled by default? For backward compatibility with version 3 can be? There is no difference in performance for me

01

1 Like
#13

Thanks for reporting back! I’m glad that this resolved your problem.

Correct, the direct-files option is disabled by default for backwards compatibility. Historically, Aerospike has always accessed files (as opposed to block devices) by way of the page cache. When we introduced the direct-files option to bypass the page cache for files, we decided to keep it disabled as the default, because it has performance implications.

closed #14

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.