Prevent clustering between installations


#1

How do you prevent one Aerospike installation from communicating with and clustering with and interfering with another? I had a working Aerospike installation on our corporate network running a YCSB test, which failed as soon as Aerospike was installed on another system on the same network. They formed a cluster and statistics showed cluster-size set to 2. Setting “paxos-max-cluster-size 1” in the .conf didn’t help. Both were installed and configured as memory only, which I thought would have been one system only. I can’t make either work when both are turned on. I have to shut one off and reboot the other to regain function.


#2

Nodes discover each other via heartbeat, running severs on different hb port will prevent them from merging.

Also in the latest version there is a cluster-name parameter. This field prevents nodes from joining clusters by a different name.


#3

Thanks Kevin. I understand that most users want auto-clustering. An additional question is How can I prevent it happening during install of the next node?


#4

Configure the node to use a different heartbeat port and/or cluster name.


#5

Since I’ve not changed either of those before, will that make it difficult for YCSB to function?


#6

Hi Kevin, thanks for the help! I’ve looked through the Aerospike doc’s and do not see any info related to disabling automatic clustering. I’m surprised given the negative affect it had on my installations. I had a system running a YCSB test and it failed immediately after the other Aerospike began operating. Every read returned errors. On top of that, I could no longer control the first installation. Restarting the Aerospike service had NO effect. Other asinfo and AQL commands also did not work. Aerospike became totally unusable. This would seem to be a BAD thing for any new user. What if someone else on their network was also experimenting with Aerospike? Each would experience sudden and unexplained failures.

I have seen the information on heartbeat, but that is either multicast and unicast, and it is not clear which would be better to use when changing the heartbeat port. Did you mean to change only the 3000, and not 3001-3003 also? What about 9918? Are there a negative affects when changing the heartbeat? Would all the utilities work? asinfo, aql, etc.

Cluster Name sounds like a good choice for me, but where or how is it set?

-Dick


#7

Hi Kevin, I found cluster-name in the doc’s and set it, but it does NOT keep them from talking and failing. After setting cluster-name on both, when I run aql on one system and type “asinfo cluster-name”, it shows both. And they are affecting each other’s operation, negatively.

How do I keep separate nodes from talking to each other? Yet allow all the utilities to work? Which ports to set? And every install has to choose a different range. -Dick


#8

When I did a “systemctl restart aerospike” it reset both systems!!! Even though they have different cluster-names.


#9

OK, so I have to change the heartbeat port, which by default is multicast 9918. Can I change it to any number in the range 0-65535?


#10

Yes, though you should avoid Linux’s privileged port range.


#11

Are both nodes hosted on the same machine?

If not were any warning at the end of the other node’s logs?


#12
  1. Could you describe how they are “affecting each other’s operation, negatively”?
  2. Did you set it dynamically or statically (in config).
    1. If statically and you restarted the node then the tools should only report one or the other.
    2. If dynamically then there are additional steps to prevent the tools from discovering them. The tools look at the alumni list (essentially a historical record of cluster members developed during runtime). This is so you can inspect a split brain cluster or see that nodes are down.

#13

They are on separate machines. I set the cluster-name in each’s .conf and restarted each service. Then I had a YCSB test running on one and did a restart on the other and kaboom-both were cleared. I guess logging is not configured by default because there isn’t any /var/log/aerospike…


#14

Anyway, I have changed the multicast port and that seems to work…so far. But I’m wondering about the multicast traffic and if I should switch to unicast (mesh)? And if yes, then how to configure that?


#15

That is unexpected.

Logging is enabled by default, but accessing the logs has changed on platforms using systemd. If you are using systemd then by default you will need to access the logs via journalctl. You can also add file based logging.

We prefer multicast for operations ease, but on most (all?) cloud providers, multicast isn’t an option, also we have seen some network hardware act poorly with multicast. See here for mesh configuration.


#16

I meant log files are not configured. Yes, it logs to stderr, the console, but what happens is errors start pouring out


#17

Could you elaborate?


#18

On the system running the YCSB test which is reading all of the 112M records, it suddenly starts scrolling read errors for non-existent keys. Because it’s db has been emptied.


#19

So, cluster-name doesn’t keep them from talking to each other. But heartbeat port seems to.


#20

BTW, does the heartbeat port in the aerospike.conf file have to be different on each system or just not be 9918 for multicast or not 3002 for mesh?