Aerospike losing documents when node goes down

aws

#1

I’ve been doing dome tests using aerospike and I noticed a behavior different than what is sold.

I have a cluster of 4 nodes running on AWS in the same AZ, the instances are t2micro (1cpu, 1gb RAM, 25gb SSD) using the aws linux with the AMI aerospike

aerospike.conf:

heartbeat {
        mode mesh
        port 3002                        
        mesh-seed-address-port XXX.XX.XXX.164 3002
        mesh-seed-address-port XXX.XX.XXX.167 3002
        mesh-seed-address-port XXX.XX.XXX.165 3002
        #internal aws IPs
...
namespace teste2 {
        replication-factor 2
        memory-size 650M
            default-ttl 365d                                                                                                                    
    	storage-engine device {
                    file /opt/aerospike/data/bar.dat
                    filesize 22G
                        data-in-memory false                                                                     
        }
}

What I did was a test to see if I would loose documents when a node goes down. For that I wrote a little code on python:

from __future__ import print_function
import aerospike
import pandas as pd
import numpy as np
import time
import sys
config = {
  'hosts': [ ('XX.XX.XX.XX', 3000),('XX.XX.XX.XX',3000),
             ('XX.XX.XX.XX',3000), ('XX.XX.XX.XX',3000)]
} # external aws ips
client = aerospike.client(config).connect()
for i in range(1,10000):
  key = ('teste2', 'setTest3', ''.join(('p',str(i))))
  try:
    client.put(key, {'id11': i})
    print(i)
  except Exception as e:
    print("error: {0}".format(e), file=sys.stderr)
  time.sleep(1)

I used this code just for inserting a sequence of integers that I could check after that. I ran that code and after a few seconds I stopped the aerospike service at one node for 10 seconds, using sudo service aerospike stop and sudo service aerospike colstart to restart.

I waited for a few seconds until the nodes did all the migration and executed the following python script:

query = client.query('teste2', 'setTest3')
query.select('id11')
te = []
def save_result((key, metadata, record)):
    te.append(record)
query.foreach(save_result)
d = pd.DataFrame(te)
d2 = d.sort(columns='id11')
te2 = np.array(d2.id11)
for i in range(0,len(te2)):
  if i > 0:
    if (te2[i] !=  (te2[i-1]+1) ):
      print('no %d'% int(te2[i-1]+1))
print(te2)

And got as response:

no 3
no 6
no 8
no 11
no 13
no 17
no 20
no 22
no 24
no 26
no 30
no 34
no 39
no 41
no 48
no 53
[ 1  2  5  7 10 12 16 19 21 23 25 27 28 29 33 35 36 37 38 40 43 44 45 46 47 51 52 54]

Is my cluster configured wrong or this is normal?

ps: I tried to include as many things I could, if you please suggest more information to include I will appreciate.


#2

We see that you posted the same issue on Stackoverflow here and since @kporter has already begun answering it there, let’s continue on that thread.

We are always happy to help, but please refrain from double-posting on the forum and on Stackoverflow … especially on the same day. :confused:


#3

Yeah, okay, my bad. Actually the question didn’t begin to be answered.

I still don’t know why this happens and I was close to switch to aerospike, but with this problem we can’t migrate until it is solved.

Thanks anyway