Stops responding on sindex drop (may be related to AER-4458)


#1

Hi,

We got a cluster with 3 nodes (2 at version 3.6.0 and 1 at version 3.6.1) with factor 2.

There was no migrations going on, so I decided to drop old indexes for sets which does not exists anymore.

After sending the drop command through AMC, Aerospike stopped respoding and clients startet to get timeouts.

AMC is showing that all nodes are down, and when trying to get a stat on the dropping process through AQL it return a “error: (9)” message.

The services are still running and there are no errors to see in the logs.

I also tried to restart one of the nodes, which failed and seems to be blocked by something (I can see in the log that it got the shutdown message “SIGTERM received, shutting down”). After this line the “INFO (drv_ssd)” line are just repeated in the logs with no changes in the values.

Should I just wait for it to finish? How do I see if it is working? Shouldn’t Aerospike be running while dropping?

Please let me know if I can get you further informations.

Thanks in regards


#2

One more thing: It seems like client connections are not closed correctly in this state.

I only got 2 clients trying to connect to Aerospike and the number of connected clients are constantly increasing (now I am getting “WARNING (demarshak) dropping client conntection: hit limit 15001 connections”

Using c# client v. 3.1.5


#3

Hi,

Did you have data in the set on which you deleted indexes ?

Can you send us last 1000 lines of aerospike log files ?

Thanks


#4

Hi again,

No there was no data - the set did not even exists (old set)

I will try to find the log lines and send to you (just attach it in here?)… the files got large. Else I will try to drop the index again and copy the log files before getting to large.


#5

So just tried to drop indexes again.

All nodes are now version 3.6.1. I stopped all clients so there was no trafic. I managed to drop 4 indexes both trough AMC and AQL. Then I started the clients for the last index to see if I could reproduce the error. and that did it! (1 of 3 nodes stopped responding)

The clients are both doing read/write, udf and batch (maybe some of these are triggering the error?) Just found this I version 3.6.2 release notes ([AER-4458] - (SINDEX) Crash occurred while performing list and map queries that return list or map bins.) maybe this could have something with this?

When running AQL SHOW INDEXES 1 of 3 nodes return Error (9) (if that could help?)

I tried to stop the aerospike service, but didn’t succeed. The only way to get it back up running was by rebooting.

I have the last 1000 lines from log files for the crashed node. I just don’t know how to upload it/send it?


#6

@LarsNymand,

Thank you. I will reach out to you privately with instructions on how to send these to @pratyyy.