Drive failure... now what?

caffeinejolt · September 27, 2017, 11:19pm

Running: aerospike-server-community-3.12.0-1.el6.x86_64

I have a three node cluster and one node had a drive failure (SSD) - total data loss, but not a big deal since it was just a cache. Now I need to throw another drive in there and get that asd instance back in the cluster. Please keep in mind that I am not an experienced aerospike user/admin - know just enough to get what I need working. My config is below - I am using multicast for heartbeat - based on what I have read, all I need to do is prep the new SSD, replace the failed drive, fire asd back up and presto - everything should be cool.

It can’t be that easy. What am I missing? Any advice on checks to run after getting things back up and running?

# Aerospike database configuration file.

service {
	user aerospike
	group aerospike
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
        service-threads 8
        transaction-queues 8
        transaction-threads-per-queue 8	
        proto-fd-max 15000
}

logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}

network {
	service {
		address any
		port 3000
	}

	heartbeat {
		mode multicast
		multicast-group 239.1.99.222
		port 9918

		# To use unicast-mesh heartbeats, remove the 3 lines above, and see
		# aerospike_mesh.conf for alternative.

		interval 150
		timeout 10
	}

	fabric {
		port 3001
	}

	info {
		port 3003
	}
}

namespace tmcache {
	replication-factor 1
	memory-size 24G # around 250 million records
	default-ttl 30d
        high-water-disk-pct 80
        high-water-memory-pct 85
        storage-engine device {
		device /dev/sdc
		scheduler-mode noop
                ## WARNING: you can raise, but cannot lower without zeroing disk
		write-block-size 256K
        }
}

Albot · September 28, 2017, 1:39am

Should be that simple. Make sure you write 0’s across the drive to ensure you’re not introducing junk data into the cluster though. ‘dd if=/dev/zero of=/dev/sdc bs=1M’ or ‘blkdiscard /dev/sdc’. Just make sure its the same disk speed/size so no surprises

Topic		Replies	Views
How does the Aerospike protect my data? Configuration	1	1170	August 16, 2014
Losing records after node fails Configuration	3	1492	May 24, 2015
Start aerospike.service fail with aerospike-server-community_6.4.0.19	3	97	August 20, 2024
Handling node failure on client	4	3826	September 23, 2024
Aerospike DRAM TO DRIVE movement	6	1119	May 19, 2017

Drive failure... now what?

Related topics