Expiration of records is late only on one node (out of six)

We have 6 nodes running 3.11.0.2 community edition.

Each node have around 20mil records with an even distribution in terms of writes

One node is starting to be heavily behind on deleting expired records. As of now it has 2 times more than the other ones

In this case node .103

Normal operation of READ and WRITE is standard and node .103 is not slow

Only looking at the NSUP logs :

Mar 07 2022 18:28:54 GMT: INFO (nsup): (thr_ nsup.c :1174) {buck_xxxxxxx} nsup start

Mar 07 2022 20:32:20 GMT: INFO (nsup): (thr_ nsup.c :1098) {buck_xxxxxx} Records: 41115293, 0 0-vt, 40897418(118769533) expired, 0(0) evicted, 0(0) set deletes. Evict ttl: 0. Waits: 0,0,6889892. Total time: 7405785 ms

The process takes more than 2 hours and other nodes are able to complete the process in more batches (shorts and with less expired) but overall keeping the node cleaner

What is the meaning of this? Waits: 0,0,6889892

Does anyone have a similar case? We think it could be an SSD problem but it is strange we don’t see any effect on the write/read performance

thanks

That is a pretty old version and the whole nsup sub-system was since redone, but if I remember correctly, one of the typical scenario for nsup to fall behind was when the generated delete transactions were not processed fast enough. So that wouldn’t be an SSD problem but a connectivity issue or even just network. The meaning of the Waits is documented here (you need to click on ‘Show removed server messages’):

Here is the relevant section, and that number is actually the throttling because of reaching the max 10,000 delete transactions. I think that it would be milliseconds rather than microseconds and that would pretty much explain that high total time (the fact that the message that replaced this one, between versions 3.14 and 4.5.1 has the waits in millisecond is another potential evidence that it is milliseconds…).

Waits: Accumulated waiting time for different stages of delete to finish, in microseconds. In each cycle, nsup performs set-deletes before general expiration and eviction.

  • n_set_waits: The first wait is the number of microseconds that nsup slept during set-deletes stage while waiting for the nsup-delete-queue to drop to 10,000 elements or less (Throttling).

  • n_clear_waits: The second wait is the number of microseconds until the nsup-delete-queue cleared (including the previous namespace if applicable) before beginning general expiration and eviction (Minimize unnecessary eviction if deletes already pending). For the last namespace in the nsup cycle, this is reported on its own line, nsup clear waits: 1441

  • n_general_waits: The third wait is the number of microseconds nsup slept during general expiration and eviction while waiting for the nsup-delete-queue to drop to 10,000 elements or less (Throttling).

Thanks, we lowered the Throttling to 50 and the node started working much better

For others reference we executed this :

asadm -e ‘asinfo -v “set-config:context=service;nsup-delete-sleep=50”’

Again, thanks for the help!

Thanks for closing the loop!