SSD based data not flowing

Anonymous · June 1, 2019, 2:21am

I have mistakenly cleared /opt/aerospike/data/ directory. What should be done so that the data will again start flowing to /opt/aerospike/data/ ? Below is the configuration for the same

namespace test {
    replication-factor 2
    memory-size 40G
    storage-engine device {
        file /opt/aerospike/data/test.dat
        write-block-size 128K
        post-write-queue 1024
        filesize 300G
    }

    stop-writes-pct 90
    high-water-memory-pct 80
    high-water-disk-pct 80
}

Albot · June 1, 2019, 2:44am

You should just be able to mkdir -p /opt/aerospike/data/, restart aerospike, and data should start going there again. Did you have a more specific problem that I’m missing?

Anonymous · June 2, 2019, 5:24am

Thanks for the information. Just wanted to understand the side effects of deleting the directory if any for a production system.

kporter · June 2, 2019, 8:47am

Is the cluster larger than one node and did this only happen on one node? If the cluster is larger than one node and only one node had the directory deleted then you should be fine with @Albot’s advise, otherwise data may be have been irreversibly lost (if there is any hope of recovery, you need to ensure no further disk activity occurs on the device such as remounting the device ‘read only’). Let us know which situation applies and we can discuss further steps if applicable.

Anonymous · June 2, 2019, 9:23am

The cluster has 3 nodes and this has happened on all the 3 nodes. The data is recoverable. One weird thing is though the data is deleted, df -h shows size of opt directory as 168 GB

Filesystem                                              Size  Used Avail Use% Mounted on
rootfs                                                  9.9G  1.5G  8.0G  16% /
udev                                                     10M     0   10M   0% /dev
tmpfs                                                   5.3G  228K  5.3G   1% /run
/dev/disk/by-uuid/b1c0fa79-870e-4709-8953-de55a59634b6  9.9G  1.5G  8.0G  16% /
tmpfs                                                   5.0M     0  5.0M   0% /run/lock
tmpfs                                                    11G     0   11G   0% /run/shm
/dev/vdb                                                315G  168G  132G  56% /opt

Albot · June 2, 2019, 7:47pm

Thats probably because the file handle is still open, so it can’t really clear the data. If you stop the daemon, it should clear. Not sure how Aerospike will deal with that. If it doesn’t clear, just run lsof /opt and see what file handles are hanging around.

kporter · June 2, 2019, 11:23pm

You may want to take a backup of the data. If multiple nodes were to shutdown before resolving the current situation then you may end up in a unrecoverable situation.

To resolve this issue, recreate the directory and restart one node at a time, waiting for migrations after each restart.

Anonymous · June 3, 2019, 12:11am

Ran lsof /opt , I could see the files there. Is there any way to recover those files

Albot · June 3, 2019, 12:24am

I found an article online that suggests it’s possible, https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c00833030 , but you would still need to restart the daemon to replace the file. Why not allow migrations to occur?

Anonymous · June 3, 2019, 12:40am

You mean to say that restarting would recover the deleted files held by process?

Albot · June 3, 2019, 3:08am

Since you are using replication factor=2, that means that every record is stored twice. There is a master record, and a replica record. If you restart this node, it will come up empty - and then the other 2 nodes will re-replicate data over to it. Once migrations are finished, you can do this to the next node - and so on, as described by @kporter. And he pointed out it might be good to grab a backup.

kporter · June 3, 2019, 4:34am

That’s an interesting idea, but I wouldn’t recommend using that particular method since it creates a copy of a file that is actively being written to. You will probably end up with a few corrupted records.

Instead, maybe you could look into creating a new hardlink to the inode. That should prevent the file from disappearing when Aerospike shuts down.

Albot · June 3, 2019, 4:35am

Definitely sounds like a recipe for corruption haha. Seems fun to try, or at least know about though.

kporter · June 3, 2019, 5:06am

I’m unable to find any tool to create a hardlink to an inode without first unmounting the underlying device. I found this conversation discussing this topic: How to create a hard link to an inode (ext4)? - Unix & Linux Stack Exchange.

I recommend taking advantage of the fact that you data is replicated and do the restart/wait for migrations procedure I mentioned before.

Also, I’m curious to know how this mishap occurred. Did you mistakenly run a preproduction or QA deployment script in production (if so you wouldn’t be the first, consider adding an environment variable to identify the environment and have scripts check that variable and fail if in the wrong environment).

Anonymous · June 3, 2019, 3:19pm

We added one more node to the existing cluster (which had 3 nodes). Once the migrations were complete on all the 4 nodes, restarted the nodes one by one waiting for migrations to complete. The cluster is stable now. Thanks for your prompt reply and support.

system · June 9, 2019, 3:19pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Restore cluster trouble	2	1364	May 27, 2016
Drive failure... now what? Operations	1	901	September 28, 2017
Restarting Aerospike server deletes all data	7	3525	February 5, 2016
Why Aerospike evicted data? Configuration	2	5638	June 12, 2017
Unused data on the disk - Error Code 8: Server memory error Configuration	11	8162	June 29, 2017

SSD based data not flowing

Related topics