Bug leak memory and lost data on Aerospike

Hello community experts!

I setup cluster with 6 nodes, memory on each node is 128GB.

Aerospike version: 4.8.0.5, client: 4.4.9

File config:

# This stanza must come first.
service {
  user root
  group root
  paxos-single-replica-limit 1
  proto-fd-max 20000
  log-local-time true
}

logging {
  # Log file must be an absolute path.
  file /data/log/aerospike-server/port3000.log {
    context any info
  }
}

network {
  service {
    address any
    port 3000
  }
 heartbeat {
    address vlanxxx
    mode mesh
    port 3002
        mesh-seed-address-port 192.168.1.10 3002
	mesh-seed-address-port 192.168.1.11 3002
        mesh-seed-address-port 192.168.1.12 3002
        mesh-seed-address-port 192.168.1.13 3002
        mesh-seed-address-port 192.168.1.14 3002
        mesh-seed-address-port 192.168.1.15 3002
    interval 150
    timeout 10
  }

  fabric {
    address vlanxxx
    port 3001
  }

  info {
    port 3003
  }
}

namespace mem_storage {
  replication-factor 2
  memory-size 12G
  default-ttl 30 # 30 days, use 0 to never expire/evict.
  storage-engine memory
}

Phenomenon: I used the map type to perform read / write to the aer, after a period of time, the bandwidth network phenomenon in the cluster is uneven: node 192.168.1.10: bw out ~990 Mbit/s, bw in on node 192.168.1.15 ~ 900 Mbit/s and other nodes bw in / out ~ 20 Mbit/s.

Log on node 192.168.1.10:

GMT: WARNING (fabric): (fabric.c:2031) fabric_connection_process_readable(0x7fa20fe0d088) invalid msg_size 135901932 remote 0x1771b2128ce94202

Mem used service aerospike ~ 2gb but the memory on node 192.168.1.10 increased continuously causing OOM, I restarted the service aerospike on node 192.168.1.10 → memory stable, but then the migrate partition between node 192.168.1.10 and 192.168.1.15 failed.

Log on node 192.168.1.15:

WARNING (migrate): (migrate.c: 1019) missing acks from node 1771b2128ce94202.

I force to restart service on node 192.168.1.15 resulting in partial loss of data.

Java code client:

public int saveItemsToMap(List<UserActivityGroup> activityGroups, Long user_id) {
        try {
            MapPolicy mapPolicy = new MapPolicy(MapOrder.UNORDERED, MapWriteFlags.DEFAULT);
            Key key = new Key(namespace, set_name, user_id);

            Map<Value, Value> map = new HashMap<>();

            for (UserActivityGroup activityGroup : activityGroups) {
                map.put(
                        Value.get(createMapKey(activityGroup)),
                        Value.get(createMapValue(activityGroup))
                );
            }

//            client.operate(eventLoops.next(), null, writePolicy, key,
//                    MapOperation.putItems(mapPolicy, bin_name, map)
//            );
            Record record = client.operate(writePolicy, key, MapOperation.putItems(mapPolicy, bin_name, map));
            return record.getInt(bin_name);
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
        }
        return 0;
    }

Thank you very much !

How many different user_ids is the calling code passing to this function? What is the size of the UserActivityGroup list?

Hi @pgupta, It’s around 700 - 1000 user ids for the fist run, and 20 - 50 earch run after that. Runs are separated by 1 minute after the last run. The size of UserActivityGroup list is 250 items for earch push. The largest size of UserActivityGroup list of user is about 500k items, but only a small part. Most are in the range of a few hundred to a few thousand items.

First, what is confusing me is the uneven load on nodes. if you have 6 nodes and 1000 userids - your key is based on userid, so records would be evenly distributed across all nodes. So why only 2 nodes are taking all the data? Is the cluster properly formed or somehow you are only updating one or two userids and therefore all load is going to only 2 nodes?

Please run

$ asadm -e info

and see if you are in fact successfully forming a 6 node cluster, otherwise check config files on each node.

Secondly, I don’t get your map Key:Value data - both are same “activityGroup” … its like creating a map of: “swimming”:“swimming” k-v pair. … but thats besides the point.

What you are doing is inserting map K-V pairs into an existing record in memory - which will keep growing in size and at some point exceeds a size that seems to be OOM ing the node and then having trouble being exchanged on the fabric - which is what you are seeing when you bring one node down and restart it. Since your namespace is pure in-memory, when you stop the second node, you lose both copies of the data hence the data loss. Based on your key size and value size, and number of KV pairs you are inserting in a record, do a record size estimate purely from the data standpoint. Aerospike should be ideally used for records not exceeding 1MB although 8MB is also allowed. This is a requirement when using persistent storage on disk. Since you are using pure in-memory storage, you are able to exceed this limit. And since you are continually growing the record for a given userid, by incrementally adding more K:V pairs - it may be growing well beyond a size that is causing issue on the fabric.

You may have initially loaded a 1000 records and then just inserting K:V pairs in a few and hitting evictions on the old ones when memory usage exceeds 60% of 12GB allocated to this namespaces. All previous records may have been evicted and you are left with a growing select few, which may be below the eviction threshold based on evict-tenths-percent parameter, and keep growing, eventually OOMing the system.

Having said all this, this does not look like a realistic, properly sized data model. Hope with all these things to look for, you can troubleshoot and pinpoint what is throwing your cluster into a bad state.

1 Like

Hi @pgupta , This is result after I run the command:

            Node               Node                 Ip       Build   Cluster   Migrations        Cluster     Cluster         Principal   Client      Uptime   
               .                 Id                  .           .      Size            .            Key   Integrity                 .    Conns           .   
node1:3000   BB99AD8B46B1FAC    node1:3000   C-4.8.0.5         6      0.000     99A087A7BC2D   True        BB9F9D7B46B1FAC      161   146:40:22   
node2:3000   *BB9F9D7B46B1FAC   node2:3000   C-4.8.0.5         6      0.000     99A087A7BC2D   True        BB9F9D7B46B1FAC      184   147:50:54   
node3:3000   BB9D6D9B46B1FAC    node3:3000   C-4.8.0.5         6      0.000     99A087A7BC2D   True        BB9F9D7B46B1FAC      161   147:09:52   
node4:3000   BB92CDAB46B1FAC    node4:3000   C-4.8.0.5         6      0.000     99A087A7BC2D   True        BB9F9D7B46B1FAC      158   166:01:09   
node5:3000   BB9A7DAB46B1FAC    node5:3000   C-4.8.0.5         6      0.000     99A087A7BC2D   True        BB9F9D7B46B1FAC      173   166:14:08   
node6:3000   BB9AFDAB46B1FAC    node6:3000   C-4.8.0.5         6      0.000     99A087A7BC2D   True        BB9F9D7B46B1FAC      184   147:46:31   
Number of rows: 6

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2020-02-24 07:06:04 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Namespace               Node      Total   Expirations,Evictions     Stop       Disk    Disk     HWM   Avail%         Mem     Mem    HWM      Stop          PI           PI      PI     PI   
              .                  .    Records                       .   Writes       Used   Used%   Disk%        .        Used   Used%   Mem%   Writes%        Type         Used   Used%   HWM%   
    mem_storage   node1:3000    5.824 M   (4.462 M, 0.000)        false         N/E   N/E     50      N/E       2.151 GB   18      60     90        undefined   355.485 MB   0       N/E    
    mem_storage   node2:3000    5.847 M   (4.701 M, 0.000)        false         N/E   N/E     50      N/E       2.160 GB   19      60     90        undefined   356.877 MB   0       N/E    
    mem_storage   node3:3000    6.071 M   (5.221 M, 0.000)        false         N/E   N/E     50      N/E       2.242 GB   19      60     90        undefined   370.563 MB   0       N/E    
    mem_storage   node4:3000    5.511 M   (5.357 M, 0.000)        false         N/E   N/E     50      N/E       2.034 GB   17      60     90        undefined   336.342 MB   0       N/E    
    mem_storage   node5:3000    5.899 M   (5.843 M, 0.000)        false         N/E   N/E     50      N/E       2.178 GB   19      60     90        undefined   360.039 MB   0       N/E    
    mem_storage   node6:3000    5.876 M   (4.720 M, 0.000)        false         N/E   N/E     50      N/E       2.171 GB   19      60     90        undefined   358.665 MB   0       N/E    
    mem_storage                      35.029 M   (30.304 M, 0.000)                0.000 B                             12.937 GB                                          0.000 B                   
    Number of rows: 7

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2020-02-24 07:06:04 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Namespace               Node      Total     Repl                         Objects                   Tombstones             Pending   Rack   
              .                  .    Records   Factor      (Master,Prole,Non-Replica)   (Master,Prole,Non-Replica)            Migrates     ID   
              .                  .          .        .                               .                            .             (tx,rx)      .   
    mem_storage   node1:3000    5.824 M   2        (2.862 M, 2.962 M, 0.000)       (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
    mem_storage   node2:3000    5.847 M   2        (3.074 M, 2.773 M, 0.000)       (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
    mem_storage   node3:3000    6.071 M   2        (2.948 M, 3.123 M, 0.000)       (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
    mem_storage   node4:3000    5.511 M   2        (2.673 M, 2.838 M, 0.000)       (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
    mem_storage   node5:3000    5.899 M   2        (3.004 M, 2.894 M, 0.000)       (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
    mem_storage   node6:3000    5.876 M   2        (2.952 M, 2.924 M, 0.000)       (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
    mem_storage                      35.029 M            (17.514 M, 17.514 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)            
    Number of rows: 7


And this is how I create map Key:Value:
    private Long createMapKey(UserActivityGroup activityGroup) {
        try {
            return activityGroup.group_id;
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
        }

        return null;
    }

    private Map<Long, String> createMapValue(UserActivityGroup activityGroup) {
        try {
            Map<Long, String> map = new HashMap<>();
            Long key = - activityGroup.start_time;
            map.put(key, new ObjectMapper().writeValueAsString(activityGroup));
            return map;
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
        }

        return null;
    }


My Value is another map with key is its timestamp (so when I get values from map, I can get ordered by start_time timestamp using the rank of Map data type).

Hi @pgupta, I check on AMC, mem used by aerospike ~ 18% but OOM the system. Server run only service aerospike.

I can’t spot anything troubling with Aerospike. Perhaps someone else will.

You should compute the size of the largest record you are inserting with 500K list size - depending on the size of the key data (groupid) and value - as string - and see what that estimate is - if id+value is 256 bytes +, 500K + inserts into the same record, record in memory could exceed 128MB - which could explain the fabric message error. If your node actually has 128GB physical memory, I am not able to explain why you are OOMing.

@pgupta: Thanks for your support.