Restore cluster trouble


#1

Hi guys, we use aerospike in our projects and caught strange problem. We have a 3 node cluster and after some node restarting it stop working. So, we make test to explain our problem

We make test cluster. 3 node, replication count = 2

Here is our namespace config

namespace test{
    replication-factor 2
    memory-size 100M
    high-water-memory-pct 90
    high-water-disk-pct 90
    stop-writes-pct 95
    single-bin true
    default-ttl 0
    storage-engine device {
    cold-start-empty true
    file /tmp/test.dat
    write-block-size 1M
    }
}

We write 100Mb test data after that we have that situation

available pct equal about 66% and Disk Usage about 34%

All good :slight_smile:

But we stopped one node. After migration we see that available pct = 49% and disk usage 50%

Return node to cluster and after migration we see that disk usage became previous about 32%, but available pct on old nodes stay 49%

Stop node one more time

available pct = 31%

Repeat one more time we get that situation

Cluster crashed, Clients get AerospikeException: Error Code 8: Server memory error

So how we can clean available pct?


#2

Appears you cross posted on Stack Overflow.

Have you had a chance to follow Ben’s instruction?

If your defrag-q is empty (and you can see whether it is from grepping the logs) then the issue is likely to be that your namespace is smaller than your post-write-queue. Blocks on the post-write-queue are not eligible for defragmentation and so you would see avail-pct trending down with no defragmentation to reclaim the space. By default the post-write-queue is 256 blocks and so in your case that would equate to 256Mb. If your namespace is smaller than that you will see avail-pct continue to drop until you hit stop-writes. You can reduce the size of the post-write-queue dynamically (i.e. no restart needed) using the following command, here I suggest 8 blocks:

asinfo -v ‘set-config:context=namespace;id=;post-write-queue=8’

If you are happy with this value you should amend your aerospike.conf to include it so that it persists after a node restart.


#3

It’s work, Thanks a lot.