Server Memory Error (Error Code 8)

error

#1

Hi,

I’m new to Aerospike and is in the process of setting up my first application data. I’m storing a text key, a java Map object and a text field in my aerospike set. I’m using 2 nodes to write data. After writing around 2 million records, I started getting this exception.

com.aerospike.client.AerospikeException: Error Code 8: Server memory error
    at com.aerospike.client.command.WriteCommand.parseResult(WriteCommand.java:72)
    at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:56)

Here’s my server configuration

OS : CentOS 6.6
Memory : 30 GB
SSD volume : 200GB

Here’s my configuration file

service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 4
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}

logging { # Log file must be an absolute path. file /mnt/ebs2/aerospike/log/aerospike.log { context any info } }

network { service { address any port 3000 }

    heartbeat {
            mode mesh
            port 3002 # Heartbeat port for this node.
            address 10.X.X.X

            # List one or more other nodes, one ip-address & port per line:
            mesh-seed-address-port 10.X.X.X 3002

            interval 150
            timeout 10
    }

    fabric {
            port 3001
    }

    info {
            port 3003
    }

}

namespace caas { replication-factor 2 memory-size 10G default-ttl 0 # 30 days, use 0 to never expire/evict.

    storage-engine memory

    # To use file storage backing, comment out the line above and use the
    # following lines instead.
    storage-engine device {
            file /mnt/ebs2/aerospike/data/caas.dat
            filesize 50G
   }

}

Not sure what exactly is causing the error ? Is it due to insufficient memory size or the physical space allocated (50G) ? I’ve looked into AMC (Attached screenshot) , but not sure I can interpret it right as far as the error is concerned.

Here’s the output from asadm

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Information~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Node   Build   Cluster      Cluster     Cluster    Free   Free   Migrates   Principal   Objects      Uptime
    .       .      Size   Visibility   Integrity   Disk%   Mem%          .           .         .           .

ip-10-254 3.6.0 2 True True 17 97 (0,0) ip-10-254 4.226 M 240:25:27 ip-10-255 3.6.0 2 True True 17 97 (0,0) ip-10-254 4.226 M 240:19:20 Number of rows: 2

~Network Information~ Node Node Fqdn Ip Client Current HB HB . Id . . Conns Time Self Foreign ip-10-254 *BB9EA97FE0A0022 ip-10-x-x-x.us-west-2.compute.internal:3000 10.x.x.x:3000 46 183156430 0 11476738 ip-10-255 BB9BC0CFF0A0022 ip-10-x-x-x.us-west-2.compute.internal:3000 10.x.x.x:3000 46 183156365 0 11476736 Number of rows: 2

Namespace Information~ Node Namespace Avail% Evictions Master Replica Repl Stop Disk Disk HWM Mem Mem HWM Stop . . . . Objects Objects Factor Writes Used Used% Disk% Used Used% Mem% Writes% ip-10-254 caas 4 0 2.074 M 2.152 M 2 true 41.301 GB 83 50 281.415 MB 3 60 90 ip-10-255 caas 4 0 2.152 M 2.074 M 2 true 41.301 GB 83 50 281.371 MB 3 60 90 Number of rows: 2

Here’s the log entry :

Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::163)  (08: 0000000030) (09: 0000000016)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::137) histogram dump: proxy (729283 total) msec
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::154)  (00: 0000298297) (01: 0000346324) (02: 0000075576) (03: 0000006737)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::154)  (04: 0000001671) (05: 0000000470) (06: 0000000170) (07: 0000000013)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::163)  (08: 0000000007) (09: 0000000009) (10: 0000000009)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (599670 total) msec
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::154)  (00: 0000369022) (01: 0000151163) (02: 0000073779) (03: 0000004079)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::154)  (04: 0000001103) (05: 0000000294) (06: 0000000168) (07: 0000000045)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::163)  (08: 0000000016) (09: 0000000001)
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Oct 21 2015 20:47:18 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4836)  system memory: free 14244288kb ( 93 percent free )
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4844)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 4226476 ::: sub_objects 0
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4852)  rec refs 4226476 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4857)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4867)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (42, 1072954, 1072912) : hb (2, 2, 0) : fab (30, 30, 0)
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4869)    heartbeat_received: self 0 : foreign 11477848
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4870)    heartbeat_stats: bt 0 bf 10044365 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4882)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4923) namespace caas: disk inuse: 44346164736 memory inuse: 270494464 (bytes) sindex memory inuse: 24544580 (bytes) avail pct 4 cache-read pct 0.00
Oct 21 2015 20:47:28 GMT: INFO (info): (thr_info.c::4943)    partitions: actual 2080 sync 2016 desync 0 zombie 0 wait 0 absent 0
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: reads (27 total) msec
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::163)  (00: 0000000025) (01: 0000000002)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: writes_master (599699 total) msec
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::154)  (00: 0000368505) (01: 0000151331) (02: 0000074115) (03: 0000004092)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::154)  (04: 0000001103) (05: 0000000294) (06: 0000000168) (07: 0000000045)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::163)  (08: 0000000030) (09: 0000000016)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: proxy (729283 total) msec
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::154)  (00: 0000298297) (01: 0000346324) (02: 0000075576) (03: 0000006737)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::154)  (04: 0000001671) (05: 0000000470) (06: 0000000170) (07: 0000000013)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::163)  (08: 0000000007) (09: 0000000009) (10: 0000000009)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: writes_reply (599670 total) msec
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::154)  (00: 0000369022) (01: 0000151163) (02: 0000073779) (03: 0000004079)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::154)  (04: 0000001103) (05: 0000000294) (06: 0000000168) (07: 0000000045)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::163)  (08: 0000000016) (09: 0000000001)
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: udf (0 total) msec
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: query (0 total) msec
Oct 21 2015 20:47:28 GMT: INFO (info): (hist.c::137) histogram dump: query_rec_count (0 total) count
Oct 21 2015 20:47:35 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /mnt/ebs2/aerospike/data/caas.dat: used 44346164736, contig-free 2551M (2551 wblocks), swb-free 3, n-w 0, w-q 0 w-tot 51494 (0.0/s), defrag-q 0 defrag-tot 28965 (0.0/s)
Oct 21 2015 20:47:35 GMT: INFO (nsup): (thr_nsup.c::1319) {caas} nsup start
Oct 21 2015 20:47:35 GMT: INFO (namespace): (namespace.c::448) {caas} hwm_breached true (disk), stop_writes true (disk avail pct), memory sz:295039044 (270494464 + 0) hwm:6442450944 sw:9663676416, disk sz:44346164736 hwm:26843545600
Oct 21 2015 20:47:36 GMT: INFO (nsup): (thr_nsup.c::1171) {caas} evicting using ttl histogram
Oct 21 2015 20:47:36 GMT: WARNING (nsup): (thr_nsup.c::1177) {caas} can’t evict - no records eligible
Oct 21 2015 20:47:36 GMT: INFO (nsup): (thr_nsup.c::1504) {caas} evict ttls 0,0,0.000
Oct 21 2015 20:47:36 GMT: INFO (nsup): (thr_nsup.c::1237) {caas} Records: 2152317, 2152317 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes, 0(0) set evicted. Evict ttls: 0,0,0.000. Waits: 0,0,0. Total time: 1148 ms
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4836)  system memory: free 14244216kb ( 93 percent free )
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4844)  migrates in progress ( 0 , 0 ) ::: ClusterSize 2 ::: objects 4226476 ::: sub_objects 0
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4852)  rec refs 4226476 ::: rec locks 0 ::: trees 0 ::: wr reqs 0 ::: mig tx 0 ::: mig rx 0
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4857)  replica errs :: null 0 non-null 0 ::: sync copy errs :: master 0
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4867)    trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: iq 0 ::: dq 0 : fds - proto (44, 1072961, 1072917) : hb (2, 2, 0) : fab (30, 30, 0)
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4869)    heartbeat_received: self 0 : foreign 11477982
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4870)    heartbeat_stats: bt 0 bf 10044493 nt 0 ni 0 nn 0 nnir 0 nal 0 sf1 0 sf2 0 sf3 0 sf4 0 sf5 0 sf6 0 mrf 0 eh 0 efd 0 efa 0 um 0 mcf 0 rc 0
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4882)    tree_counts: nsup 0 scan 0 dup 0 wprocess 0 migrx 0 migtx 0 ssdr 0 ssdw 0 rw 0
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4923) namespace caas: disk inuse: 44346164736 memory inuse: 270494464 (bytes) sindex memory inuse: 24544580 (bytes) avail pct 4 cache-read pct 0.00
Oct 21 2015 20:47:38 GMT: INFO (info): (thr_info.c::4943)    partitions: actual 2080 sync 2016 desync 0 zombie 0 wait 0 absent 0
Oct 21 2015 20:47:38 GMT: INFO (info): (hist.c::137) histogram dump: reads (27 total) msec

Any pointer will be highly appreciated.

Thanks,

Shamik


#2

Try remove the storage engine memory from your namespace caas. One namespace can only be stored either completly in memory (with optional, async persistence to a file on HDD/SSD) or on a ssd. Which do you prefer? Example recipes for both can be found at http://www.aerospike.com/docs/operations/configure/namespace/storage/

Cheers, Manuel


#3

Thanks @ManuelSchmidt

I should have commented out the following line in namespace

storage-engine memory

Couple of follow-up question if you don’t mind answering.

  1. memory-size (10G) in my case will be used for primary and secondary index. What if the index size grows beyond the allocated value ? What’s the best practice around this ?

  2. I saw hwm_breached in log

INFO (namespace): (namespace.c::448) {caas} hwm_breached true (disk), stop_writes true (disk avail pct), memory sz:295039044 (270494464 + 0) hwm:6442450944 sw:9663676416, disk sz:44346164736 hwm:26843545600

I’ve read that to address this , we need to do the following

  1. Speed up your current eviction rate by reducing the memory/disk high-water-mark.
  2. Slow down your migration speed , if migrations are happening.
  3. Increase your defragmentation priority.
  4. Speed up your defragmentation frequency.
  5. Increase the stop_writes rate on your namespace.

Where and how do I control these ? Are these part of configuration file ? Is there an example ?

Again, appreciate your help.


#4

Regarding 1.: is not supposed to happen. A node needs sufficient memory to store indices for his partition of the namespace. Using 10 nodes without replication will lead to 10% of the records on every node and each will need enough memory to index those 10%.

Regarding 2: make sure to carefully read about those settings as they might behave differently than expected. E.g. an eviction is a permanent delete of a record (they don’t even need to be timed out) for the sake of keeping the system online. Some of the configurations you mentioned are labeled ‘deprecated’, I’m not sure what the future-proof way to do it looks like. Never ran into such issues.

I’m sure some AS staff can help you out better with the second question.


#5

@ManuelSchmidt , thanks again. This was quite informative.

I’m hoping someone from AS provides information on this.


#6

I would encourage you to attend the following class: AS201 - Aerospike Administration and Operations. The class will cover the core mechanisms of Aerospike and the settings that are critical to setup and operate Aerospike clusters successfully.

Based on your logs, you seem to have not only breached the HWM on disk but also reached stop writes due to avail pct being below 5%. So you need to size and tune your cluster properly, mainly evictions rate and defragmentation speed, based on your workload and use case requirements of course.