Aerospike 3.5.15 shutting down with no warning [Resolved]

We have a Aerospike 3.5.15 trying to migrate from a 3.5.12 node but it has shutdown twice now after a few hours of running. The /var/log/aerospike.log do not contain any warnings or errors in the end of the log and everything appears to be fine with plenty of memory and storage. Is there any other places I can look for diagnosing?

Both of them are running on VMWare with 8 cores and 64GB RAM.

1 Like

In aerospike.log, is there an abrupt halt? I know that you said that you did not see any illuminating messages at the end of the logs.

What do you see in /var/logs/messages? Specifically, is there any indications of maybe out of memory errors? That would be the next place that I would look for ominous, abrupt halts.

Let us know what you find there,

-DM

Hi Dave,

There might be a memory issue after all. I found one line where it said 1 % free system memory. Just about 1GB and its part of (thr_info.c) After that is another round of histogram dumps.

I’ve started it now with reduced memory settings.

Hi Isaack-

Is everything fine? Does the node stay alive for more than a few hours?

If you need additional assistance, would you let us know?

Thank you for your time,

-DM

Hello Dave,

I can’t tell yet, I had to shut it down as it was interfering heavily with our app, once it started migrating. We are waiting for some Datacenter changes from our provider, before we will start it again.

Is it correct if one node is running with Unicast discovery, then even if you start up a new node with the Mesh configuration, it will still start migrating? We would like to prevent the auto-migration but I’m not sure if its possible without restarting the first node.

Hi Isaack-

Regardless of whether the cluster is using mesh or multicast, if the nodes hold data, and a node either joins or leaves the cluster, the cluster migrates. The nodes in the cluster must all use either mesh or multicast.

Let’s say that we have a four-node cluster. The cluster holds 100 records. When node_3 leaves the cluster, the remaining nodes in the cluster migrate to balance storage of records. When node_3 returns to the cluster, the nodes again migrate to balance records. While I have simplified the process for the purposes of discussion, this is the intended behavior of the product.

I don’t mean to answer a question with another question, but why do you wish to turn off migrations? How were migrations interfering with your application?

I hope this helps. Would you let me know your thoughts?

Thank you for your time,

-DM

Hi Dave,

No problem with the question.

We’ve made a new data structure which we wanted to start on Node 2 while Node 1 is running with the old data structure.

But when the migration starts towards Node 2, it appears to be blocking or slowing down access from our apps to the Aerospike cluster. At first we though it completely blocked our access but we’ve started to believe, that instead, its just really slow.

We’ve considered just changing the namespace name but it still insists on migrating when we start up Node 2, even though we haven’t configured the old namespace name on Node 2. This is the reason I wanted to disable clustering just temporarily. Because it takes a very long time to restart Node 1 and it takes a very long time to complete the migration while at the same time, it blocks/slows down our apps.

Hi Isaack-

As I understand it, you have a two-node cluster. The nodes are identical. Node_1 has the existing data, and node_2 has the new data. You want to keep them separate, and prevent migrations between them?

Would it be possible to create two separate one-node clusters? Cluster_1 would hold the existing data, and cluster_2 would hold the new data? You could perform testing side-by-side without migrations?

Alternatively, you could create a second two-node cluster and use that for new data, and leave existing data on the current cluster.

Would either of those answers work for you?

I hope this helps,

-DM

Hi Dave,

Yes, I wanted to prevent migration as it is taking too long and blocking our applications.

We cannot easily put it on another subnet at the moment. If we decide to restart node_1, can we change heartbeat mode to mesh(replication factor=1) to prevent auto discovery/migration and then start up the new cluster, also with mesh but with replfactor=2 and with a seed node address?

But yes, we can create a new cluster.

Hi Isaack-

Yes, you could have one node set to use mesh, and the other node set to multicast, and they would not form a cluster.

I hope this helps,

-DM

Hi Dave,

We’ve now succeeded in running both clusters with each its own data. And everything appears to be working fine without it shutting down.

Thanks for the help!

1 Like