Cluster configuration


#1

I have 3 nodes and 1 namespace with replication disabled. One node has less ram than the other two and because of this the node stopped accepting writes. Instead of the data being written to one of the other two available nodes it was lost. Is this the intended behavior?

3 hosts in cluster: 10.254.0.222:3000,10.254.0.225:3000,10.254.0.228:3000 -v "namespace/links" 10.254.0.222:3000 returned : type=device objects=2194052931 master-objects=2194052931 prole-objects=0 expired-objects=838932030 evicted-objects=0 set-deleted-objects=0 set-evicted-objects=0 used-bytes-memory=140419387584 data-used-bytes-memory=0 index-used-bytes-memory=140419387584 sindex-used-bytes-memory=0 free-pct-memory=31 max-void-time=154413271 non-expirable-objects=0 current-time=151865394 stop-writes=false hwm-breached=false available-bin-names=32766 ldt_reads=0 ldt_read_success=0 ldt_deletes=0 ldt_delete_success=0 ldt_writes=0 ldt_write_success=0 ldt_updates=0 ldt_errors=0 used-bytes-disk=561677550592 free-pct-disk=47 available_pct=29 cache-read-pct=0 sets-enable-xdr=true memory-size=206158430208 high-water-disk-pct=90 high-water-memory-pct=90 evict-tenths-pct=5 stop-writes-pct=90 cold-start-evict-ttl=4294967295 repl-factor=1 default-ttl=2592000 max-ttl=0 conflict-resolution-policy=generation allow_versions=false single-bin=false enable-xdr=false disallow-null-setname=false total-bytes-memory=206158430208 total-bytes-disk=1073741824000 defrag-lwm-pct=50 defrag-sleep=1000 defrag-startup-minimum=10 write-smoothing-period=0 max-write-cache=67108864 min-avail-pct=5 post-write-queue=256 data-in-memory=false dev=/dev/sdb dev=/dev/sdc filesize=17179869184 writethreads=1 writecache=67108864 obj-size-hist-max=100 10.254.0.225:3000 returned : type=device objects=2204198148 master-objects=2204198148 prole-objects=0 expired-objects=843113614 evicted-objects=10949766 set-deleted-objects=0 set-evicted-objects=0 used-bytes-memory=141068681472 data-used-bytes-memory=0 index-used-bytes-memory=141068681472 sindex-used-bytes-memory=0 free-pct-memory=28 max-void-time=154413161 non-expirable-objects=0 current-time=151865325 stop-writes=false hwm-breached=false available-bin-names=32766 ldt_reads=0 ldt_read_success=0 ldt_deletes=0 ldt_delete_success=0 ldt_writes=0 ldt_write_success=0 ldt_updates=0 ldt_errors=0 used-bytes-disk=564274725888 free-pct-disk=47 available_pct=5 cache-read-pct=0 sets-enable-xdr=true memory-size=198642237440 high-water-disk-pct=95 high-water-memory-pct=95 evict-tenths-pct=5 stop-writes-pct=90 cold-start-evict-ttl=4294967295 repl-factor=1 default-ttl=2592000 max-ttl=0 conflict-resolution-policy=generation allow_versions=false single-bin=false enable-xdr=false disallow-null-setname=false total-bytes-memory=198642237440 total-bytes-disk=1073741824000 defrag-lwm-pct=90 defrag-sleep=1000 defrag-startup-minimum=10 write-smoothing-period=0 max-write-cache=67108864 min-avail-pct=5 post-write-queue=256 data-in-memory=false dev=/dev/sdb dev=/dev/sdc dev=/dev/sdd filesize=17179869184 writethreads=1 writecache=67108864 obj-size-hist-max=100 10.254.0.228:3000 returned : type=device objects=2093795097 master-objects=2093795097 prole-objects=0 expired-objects=800500445 evicted-objects=0 set-deleted-objects=0 set-evicted-objects=0 used-bytes-memory=134002886208 data-used-bytes-memory=0 index-used-bytes-memory=134002886208 sindex-used-bytes-memory=0 free-pct-memory=50 max-void-time=154413143 non-expirable-objects=0 current-time=151865259 stop-writes=false hwm-breached=false available-bin-names=32766 ldt_reads=0 ldt_read_success=0 ldt_deletes=0 ldt_delete_success=0 ldt_writes=0 ldt_write_success=0 ldt_updates=0 ldt_errors=0 used-bytes-disk=536011544832 free-pct-disk=50 available_pct=31 cache-read-pct=0 sets-enable-xdr=true memory-size=268435456000 high-water-disk-pct=90 high-water-memory-pct=90 evict-tenths-pct=5 stop-writes-pct=90 cold-start-evict-ttl=4294967295 repl-factor=1 default-ttl=2592000 max-ttl=0 conflict-resolution-policy=generation allow_versions=false single-bin=false enable-xdr=false disallow-null-setname=false total-bytes-memory=268435456000 total-bytes-disk=1073741824000 defrag-lwm-pct=50 defrag-sleep=1000 defrag-startup-minimum=10 flush-max-ms=1000 fsync-max-sec=0 write-smoothing-period=0 max-write-cache=67108864 min-avail-pct=5 post-write-queue=256 data-in-memory=false dev=/dev/sdb dev=/dev/sdc filesize=17179869184 writethreads=1 writecache=67108864 obj-size-hist-max=100


#2

Looking at your config, it seems your available_pct for node 10.254.0.225:3000 is 5 %. Which is very low, ontop of that I noticed your not evicting data, because of your high-water-memory-pct is set to 95 and your stop-writes-pct is default at 90%.

The purpose behind high-water-memory-pct is data will be evicted if the memory utilization is greater than the specified percentage. Which will help maintain your available_pct. But by setting the high-water-memory to 95% and your stop-write is default at 90%, it will trigger the stop-writes and causes no eviction. We recommend you lower it back down to 60 on your high-water-memory-pct. Which will help you to start evicting data to help recover your available_pct and get your cluster going again. It might also help if you speed up the defragmentation by bumping these parameters up to defrag-lwm-pct to 60 and set defrag-period to 1.

Also the way Aerospike handles the data. In Aerospike Database, having a NO replicated data is referred to as replication factor = 1, meaning that there’s a single copy of the database.

Since there are 4096 partitions in total and there are three nodes in the cluster, each node has 1/3 of the data – 1365 partitions, assigned at random.

Each node is the data master for 1/3 of the data partitions – a node is a data master if it is the primary source for reads and writes to that data.

So when a node reaches a stop-writes, it can not write to that partition. Further more since the two other node does not hold that particular partition, it can not write to them.

Also to answer your question if this is intended behavior, the answer is yes. Also remember Aerospike did not lose your data, since data was never written into the database successfully.

More info on Data Distribution More info on Stop Writes and Data management Aerospike available percent defined

Hope this helps you out.

Jerry


#3

As Jerry mentioned, Aerospike hashes record keys into 4096 partitions which are then distributed across all the nodes, and, for a typical replication factor 2 config, each partition will be assigned a master node and a replica node.

When writing or updating a record, for consistency considerations, the record in the master partition will have to be processed first (or if a writing a new record, the master partition assigned to the record’s key hash will be accessed).

If the node holding that master partition is not accepting any writes anymore, the write/update will fail and an error will be sent back to the client so that it can be retried later.

Records assigned to the other 2 nodes will still be processed fine (even if the replica partition is on the node that is not accepting write, there is some extra room for such writes).

This is a different situation compared to losing a node altogether where the partitions will be re-distributed across the remaining nodes (migrations) and writes will still proceed (assuming the remaining node have enough storage/ram to hold all of the data of course).

The other solution in this case, is to follow Jerry’s recommendation and lower the disk/memory high water mark to force eviction before reaching stop-write level and free up some space.


#4

Hi, Thank you both for your helpful explanations. I will just throw more hardware at it as the amount of records I need to store has grown beyond my initial design.