Upgrade to shadow device

rlz · June 24, 2015, 11:29am

Recently Aerospike introduce shadow device:

https://www.aerospike.com/docs/deploy_guides/aws/recommendations/#shadow-device-configuration

But from documentation it is not clear how to migrate to this setup from previous bcache setup. Should we just use the same EBS from bcache setup as shadow device and Aerospike will find data on it or do something else to move data?

Other question is that I assume at start time Aerospike always copy data from shadow device to primary, is it true?

kporter · June 24, 2015, 7:20pm

Yes, haven’t created such a document yet. I suspect it would be something along the lines of:

Know that this procedure hasn’t been verified.
For each node:
1. Stop Aerospike
2. dd if=/dev/BCACHE_DEVICE of=/dev/EBS_DEVICE
3. Remove bcache configuration
4. Zeroize local ssd: dd if /dev/zero of /dev/LOCAL_SSD
5. Reconfigure Aerospike to use device /dev/LOCAL_SSD /dev/EBS_DEVICE
6. Start Aerospike and local device will be synced with shadow device.

On warm-start (Enterprise Only) It only copies data from the shadow device if the local device doesn’t contain an Aerospike header. Otherwise it will assume the disks are in sync. On cold-start it will always copy from the shadow device.

Mohit_Gupta · July 24, 2019, 5:41pm

Curious to know the reasoning behind zeroing the local disk. By not doing this, would it not save the need of copying data back to local disk from the EBS device on the startup (basically the step 7)?

Also, what should be the expected behaviour if the local disk from local disk is neither zeroed and also not copied to EBS device? Will it start as a clean/new node and data will eventually get copied from the other nodes (assuming replication factor >= 2)?

kporter · July 24, 2019, 6:13pm

It may have just been a precaution. I believe the concern was that some writes may not exist on the local_ssd if bcache is in write-back mode.

BTW, on CE it will always copy the data from the shadow device to the local SSD. I updated that part of my response.

Mohit_Gupta · July 24, 2019, 6:43pm

Okay, thanks.

BTW, on CE it will always copy the data from the shadow device to the local SSD. I updated that part of my response.

Does it check for the presence of aerospike header in the shadow device before copying it to the local device on startup? Otherwise, if a new node is added (both local and shadow device as blank), even then the data will get copied from shadow to local device? If so, it would slow down the startup process significantly.

kporter · July 24, 2019, 7:14pm

Yes, the coldstart routine is executed against the shadow-device - this check is part of that routine.

Mohit_Gupta · July 25, 2019, 2:05pm

Okay, thanks.

I ran into one more issue while adding the shadow disk. In the test setup, I have a cluster of 3 nodes(Aerospike Community Edition build 3.9.0.3), with each node having local SSD as the primary device. The cluster has one namespace with approx. 28M records. Starting with the 1st node, I added the blank persistent SSD as the shadow disk to the node & the aerospike configuration and restarted aerospike. However, the startup got stuck for hours. There is no error as such in the logs. All the resources (cpu, disk and network) usage is also almost nil. After a few hours, the startup got completed and the node joined the cluster again. Afterwards, I repeated the same on other nodes and the behavior is not consistent. Some nodes were able to start in around a minute as expected, whereas others took few hours. I each tried adding a fresh node with a new blank shadow disk to the existing cluster and it’s startup also took few hours.

Attached logs for one of the incident. aerospike-shadow-startup-issue.log.gz (3.3 KB)

kporter · July 25, 2019, 7:01pm

Some of the nodes may (in the past) used more blocks of storage - this will make startup take a bit longer. Also it may have been that these nodes had to do eviction during the startup process which can make startup much faster.

For simple restarts like this, Aerospike Enterprise has ‘Fast Restart’ which reduces multi-hour cold-starts to a few seconds.

Mohit_Gupta · July 25, 2019, 7:45pm

I doubt if eviction or used blocks could be the reason as it happened on a fresh node as well (blank local SSD + blank shadow device). I even enabled the logs in debug more but didn’t find anything unusual. Is there any way to know what aerospike process is doing at any moment, something like thread-dump for JVMs?

I had also attached the logs in the last comment. It would be great if you could take a look.

Appreciate the help. Thanks

kporter · July 26, 2019, 5:52pm

Can’t determine anything from the logs. Paxos was starting which occurs after drive loads. This is a really old version of Aerospike, the these code paths have changed quite a bit. The issue seems to have resolved itself for now, I’d be happy to look further if it occurs in our latest builds.

system · August 1, 2019, 5:52pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to detect if shadow device needs to be zero-ed? Installation aws	6	2633	March 14, 2016
Guideline on the rationale behind shadow EBS in Aerospike 3.5.14	3	1825	June 23, 2015
Error re-mounting EBS volume (shadow device) Operations aws	2	1217	October 16, 2018
How to use Aerospike in Cloud Datacenter for Update Operations?	0	778	August 4, 2018
Shadow volume along side of SSD Java Client	1	1296	February 1, 2016

Upgrade to shadow device

Related topics