I3 instance - non-root - nvme0n1: Permission denied - asd unable to start

aws

#1

We’ve recently spun some instances in AWS. Specifically i3.4xlarge instances from the aerospike ami (ami-da6c3da0). Our aerospike.cfg is utilizing a single namespace with device access to nvme0n1 and nvme1n1. Attempting to start asd results in the error below:

Feb 02 2018 20:35:44 GMT: WARNING (drv_ssd): (drv_ssd.c:3499) unable to open device /dev/nvme0n1: Permission denied

Additionally, we have followed the instructions for running non-root and both adding aerospike to the disk group and trying the udev tweak result in the same error.

Is it possible we are doing it wrong or missing something specific?

service {
	user root
	group root
	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
	pidfile /var/run/aerospike/asd.pid
	proto-fd-max 30000
	migrate-threads 8
	migrate-max-num-incoming 16
	nsup-period 60
}
logging {
	# Log file must be an absolute path.
	file /var/log/aerospike/aerospike.log {
		context any info
	}
}
network {
	service {
		address any
		port 3000
	}
	heartbeat {
		mode mesh
		port 3002 # Heartbeat port for this node.

		# List one or more other nodes, one ip-address & port per line:
		mesh-seed-address-port 172.20.0.41  3002
	 	mesh-seed-address-port 172.20.0.49  3002
	 	mesh-seed-address-port 172.20.0.90  3002
	 	mesh-seed-address-port 172.20.0.183 3002
		mesh-seed-address-port 172.20.0.253 3002
		mesh-seed-address-port 172.20.1.98  3002
		mesh-seed-address-port 172.20.1.198 3002
		mesh-seed-address-port 172.20.1.213 3002
		mesh-seed-address-port 172.20.1.217 3002


		interval 5000
		timeout 10
	}
	fabric {
		port 3001
	}
	info {
		port 3003
	}
}
namespace a {
	replication-factor 2
	memory-size 120G
	default-ttl 180d

	high-water-disk-pct 50
	high-water-memory-pct 90
	stop-writes-pct 95

	migrate-sleep 0
	disable-write-dup-res true
	evict-hist-buckets 10000000
	evict-tenths-pct 10

	storage-engine device {
		device /dev/nvme0n1
		device /dev/nvme1n1
		write-block-size 128K
	}
}

#2

Did you initialize the ephemeral drives before you started the daemon the first time?


#3

Yes and no as the documentation states that “Newly provisioned blank EBS volumes and all Ephemeral disks are already zeroed.” These hosts are all new instances with fresh new ephemeral disks.


#5

Huh. Your mesh seed list shows other nodes in the cluster. Has this been an issue for those instances as well?


#6

Yes… all 9 throw the same error when trying to start the service. Note that this is trying both the group modification and the udev tweak.


#7

I rolled a box in a test environment and went as crazy far as to

chown aerospike:disk /dev/nvme0n1 
  &&  chown aerospike:disk /dev/nvme1n1

along with

usermod -a -G disk aerospike

and STILL got permission denied. Full log is below…

Feb 06 2018 15:46:58 GMT: FAILED ASSERTION (drv_ssd): (drv_ssd.c:112) /dev/nvme0n1: DEVICE FAILED open: errno 13 (Permission denied)
Feb 06 2018 15:46:57 GMT: INFO (as): (as.c:318) <><><><><><><><><><>  Aerospike Community Edition build 3.15.1.3  <><><><><><><><><><>
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) # Aerospike database configuration file for deployments using mesh heartbeats.
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) service {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	user aerospike
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	group aerospike
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	pidfile /var/run/aerospike/asd.pid
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	proto-fd-max 15000
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) }
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) logging {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	# Log file must be an absolute path.
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	file /var/log/aerospike/aerospike.log {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		context any info
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	}
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) }
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) network {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	service {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		address any
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		port 3000
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	}
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	heartbeat {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		mode mesh
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		port 3002 # Heartbeat port for this node.
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		# List one or more other nodes, one ip-address & port per line:
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) #		mesh-seed-address-port 10.10.10.11 3002
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) #		mesh-seed-address-port 10.10.10.12 3002
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) #		mesh-seed-address-port 10.10.10.13 3002
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) #		mesh-seed-address-port 10.10.10.14 3002
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		interval 250
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		timeout 10
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	}
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	fabric {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		port 3001
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	}
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	info {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		port 3003
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	}
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) }
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) namespace bmac {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	replication-factor 1
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	memory-size 120G
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	default-ttl 180d
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	high-water-disk-pct 50
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	high-water-memory-pct 90
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	stop-writes-pct 95
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	migrate-sleep 0
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	disable-write-dup-res true
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	evict-hist-buckets 10000000
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	evict-tenths-pct 10
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	storage-engine device {
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		device /dev/nvme0n1
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		device /dev/nvme1n1
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 		write-block-size 128K
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) 	}
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3531) }
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3551) system file descriptor limit: 100000, proto-fd-max: 15000
Feb 06 2018 15:46:57 GMT: INFO (hardware): (hardware.c:1785) detected 16 CPU(s), 8 core(s), 1 NUMA node(s)
Feb 06 2018 15:46:57 GMT: INFO (socket): (socket.c:2549) Node port 3001, node ID bb9de6bdab97e12
Feb 06 2018 15:46:57 GMT: INFO (config): (cfg.c:3592) node-id bb9de6bdab97e12
Feb 06 2018 15:46:57 GMT: INFO (namespace): (namespace_ce.c:96) {bmac} beginning cold start
Feb 06 2018 15:46:57 GMT: INFO (drv_ssd): (drv_ssd.c:3419) usable device size must be header size 1048576 + multiple of 131072, rounding down
Feb 06 2018 15:46:57 GMT: INFO (drv_ssd): (drv_ssd.c:3525) opened device /dev/nvme0n1: usable size 1899999920128, io-min-size 512
Feb 06 2018 15:46:57 GMT: INFO (drv_ssd): (drv_ssd.c:3419) usable device size must be header size 1048576 + multiple of 131072, rounding down
Feb 06 2018 15:46:57 GMT: INFO (drv_ssd): (drv_ssd.c:3525) opened device /dev/nvme1n1: usable size 1899999920128, io-min-size 512
Feb 06 2018 15:46:57 GMT: INFO (drv_ssd): (drv_ssd.c:1045) /dev/nvme0n1 has 14495849 wblocks of size 131072
Feb 06 2018 15:46:58 GMT: INFO (drv_ssd): (drv_ssd.c:1045) /dev/nvme1n1 has 14495849 wblocks of size 131072
Feb 06 2018 15:46:58 GMT: FAILED ASSERTION (drv_ssd): (drv_ssd.c:112) /dev/nvme0n1: DEVICE FAILED open: errno 13 (Permission denied)
Feb 06 2018 15:46:58 GMT: WARNING (as): (signal.c:210) SIGUSR1 received, aborting Aerospike Community Edition build 3.15.1.3 os el6
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: found 12 frames
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 0: /usr/bin/asd(as_sig_handle_usr1+0x36) [0x46f8c6]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 1: /lib64/libc.so.6(+0x35270) [0x7fb2b452b270]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 2: /lib64/libpthread.so.0(raise+0x2b) [0x7fb2b578646b]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 3: /usr/bin/asd(cf_fault_event+0x216) [0x4fc2fd]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 4: /usr/bin/asd(ssd_fd_get+0xae) [0x4dc9a3]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 5: /usr/bin/asd(ssd_read_header+0x52) [0x4dcce8]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 6: /usr/bin/asd(ssd_load_records+0xa9) [0x4e1f7f]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 7: /usr/bin/asd(as_storage_namespace_init_ssd+0x4d7) [0x4e352c]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 8: /usr/bin/asd(as_storage_init+0x69) [0x4d90e4]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 9: /usr/bin/asd(main+0x2ef) [0x43d1c1]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 10: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb2b4517c05]
Feb 06 2018 15:46:58 GMT: INFO (as): (signal.c:214) call stack: frame 11: /usr/bin/asd() [0x43c4c9]

#8

Also tried executing asfixownership to see if something was stale in the image. Still no luck.


#9

Also tried a disk init zero write just to cover all the bases. Still the same denied error.


#10
  1. If you use root, does Aerospike start? If yes then it seems to be a permission issue, if no let’s try to get running as root working first.
  2. Assuming a permission issue, try following the second method mentioned in “Configure SSD Resources used by Namespace” at https://www.aerospike.com/docs/operations/configure/non_root which begins with “or add a udev rule to the Aerospike user…”

#11

Looks like you have already tried the udev tweak, could you share your “/etc/udev/rules.d/99-aerospike.rules”.

Also are you sure these commands executed successfully?

udevadm control --reload-rules
udevadm trigger

#12

A bit of progress. I rolled a test run using CentOS7, disabled selinux and I HALF replicated the same issue. While adding aerospike user to disk group, it did not work; however, utilizing the udev rule below resulted in successful start of the service.

KERNEL=="nvme[0-9]n1", OWNER="aerospike"

I returned to my AWS instance and; utilizing the same udev rule above, still get the permission denied error.

CentOS kernel details:

[centos@ip-10-0-0-150 ~]$ cat /etc/system-release && uname -a CentOS Linux release 7.4.1708 (Core) Linux ip-10-0-0-150.velox.adtheorent.com 3.10.0-693.11.6.el7.x86_64 #1 SMP Thu Jan 4 01:06:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Aerospike AMI details:

[ec2-user@ip-10-0-0-92 ~]$ cat /etc/system-release && uname -a Amazon Linux AMI release 2017.09 Linux ip-10-0-0-92 4.9.70-22.55.amzn1.x86_64 #1 SMP Wed Dec 20 23:36:28 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


#13

We are able to successfully run as root; however, not a best practice.


#14

Yes, I ensured execution of udev rules update. With the AWS Linux even rebooting the host after inclusion of the udev rules results in the same permissions error.


#15

You didn’t mention if you tried disabling selinux on the aws instance, did you?


#16

What is incredibly interesting is I just went to verify my 99-aerospike.rules file and it wasn’t there. After containing my anger I created it again, verified it’s existence… reloaded udev rules and triggered and THEN it started…


#17

Negative, selinux configuration was not modified on the AWS kernel. It is disabled by default.

[root@ip-10-0-0-92 /]# sestatus SELinux status: disabled