I have mistakenly cleared /opt/aerospike/data/ directory. What should be done so that the data will again start flowing to /opt/aerospike/data/ ? Below is the configuration for the same
You should just be able to mkdir -p /opt/aerospike/data/, restart aerospike, and data should start going there again. Did you have a more specific problem that I’m missing?
Is the cluster larger than one node and did this only happen on one node? If the cluster is larger than one node and only one node had the directory deleted then you should be fine with @Albot’s advise, otherwise data may be have been irreversibly lost (if there is any hope of recovery, you need to ensure no further disk activity occurs on the device such as remounting the device ‘read only’). Let us know which situation applies and we can discuss further steps if applicable.
The cluster has 3 nodes and this has happened on all the 3 nodes. The data is recoverable. One weird thing is though the data is deleted, df -h shows size of opt directory as 168 GB
Thats probably because the file handle is still open, so it can’t really clear the data. If you stop the daemon, it should clear. Not sure how Aerospike will deal with that. If it doesn’t clear, just run lsof /opt and see what file handles are hanging around.
You may want to take a backup of the data. If multiple nodes were to shutdown before resolving the current situation then you may end up in a unrecoverable situation.
To resolve this issue, recreate the directory and restart one node at a time, waiting for migrations after each restart.
Since you are using replication factor=2, that means that every record is stored twice. There is a master record, and a replica record. If you restart this node, it will come up empty - and then the other 2 nodes will re-replicate data over to it. Once migrations are finished, you can do this to the next node - and so on, as described by @kporter. And he pointed out it might be good to grab a backup.
That’s an interesting idea, but I wouldn’t recommend using that particular method since it creates a copy of a file that is actively being written to. You will probably end up with a few corrupted records.
Instead, maybe you could look into creating a new hardlink to the inode. That should prevent the file from disappearing when Aerospike shuts down.
I recommend taking advantage of the fact that you data is replicated and do the restart/wait for migrations procedure I mentioned before.
Also, I’m curious to know how this mishap occurred. Did you mistakenly run a preproduction or QA deployment script in production (if so you wouldn’t be the first, consider adding an environment variable to identify the environment and have scripts check that variable and fail if in the wrong environment).
We added one more node to the existing cluster (which had 3 nodes). Once the migrations were complete on all the 4 nodes, restarted the nodes one by one waiting for migrations to complete. The cluster is stable now. Thanks for your prompt reply and support.