Writes failing to aerospike cluster

Hi,

I am new to using aerospike. For my use case, I am reading data from EMR and writing to an aerospike cluster of 2 AWS instances which are of m4.2xlarge instance types. One thing that I am noticing is that the successfull TPS is almost half of the total TPS during the write job. I am attaching the snapshot for reference. So if the messages are actually getting dropped, is there any paramter in the Aerospike client that I can use to guard against it.

Thanks,

  1. Aerospike version?

  2. Grep logs for a line similar to:

    {ns_name} client: tsvc (0,0) proxy (0,0,0) read (126,0,1,3) write (2886,0,23) delete (197,0,1,19) udf (35,0,1) lang (26,7,0,3)
    

    Share a few during this event - For details see: http://www.aerospike.com/docs/reference/serverlogmessages

  3. Share namespace stats:

    asadm -e "show stat namespace"
    

    For details see: Metrics Reference | Aerospike Documentation

Hi,

I am using Aerospike version 3.10.1.1.

I am attaching the screenshot for the logs when the client was doing the writes

For node 2 it was

Mar 31 2017 01:21:46 GMT: INFO (info): (ticker.c:551) {namespace-dev} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (19893,4454,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0) Mar 31 2017 01:21:56 GMT: INFO (info): (ticker.c:551) {namespace-dev} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (19893,4454,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0) Mar 31 2017 01:22:06 GMT: INFO (info): (ticker.c:551) {namespace-dev} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (19893,4454,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0)

Also if you can tell me which parameters from the stat command you are specifically interested in, I can add them to the comment.

Thanks, Soudipta

asadm -e "show stat namespace for <namespace name> like client write fail"

Hi Kevin, Thanks for the pointer. I am attaching the screen shot of the stat command

There is quite a few events that had been dropped during the writes as I can see from the client_write_error.

I am currently having my client do threaded writes and I think that is adding load to the cluster. I am curious to know if there is any parameter than we can set in the aerospike server that can perform threaded writes to the disk.

Thanks, Soudipta

The value of fail_key_busy would indicate you are hitting one or more hotkeys. Which happens when you have more than transaction-pending-limit transactions queued to a single key.

You should do your best to design your application to avoid hot keys. You can also try (if your use case allows) using the replace or replace-only record-exists policies which can significantly improve write performance.

In this case the client will receive error code 14 - key busy response from the server.

You could also try increasing the transaction-pending-limit. If that reduces the problem, you are on the right track. Increasing the transaction-pending-limit will adversely affect write latency. So not the best way to fix this problem. You need to mitigate the hot key by modeling your data differently.

asadm
Admin>asinfo -v 'set-config:context=service;transaction-pending-limit=40'

I think I have reduced the hotkey issue by reducing the number of threads on my client while doing the writes. Maybe this is just interim and like you are suggesting remodelling the schema is the correct way to go. But coming back the issue, I still see dropped writes after reducing the number of threads, only this time I dont see fail_key_busy count to be increased. Which I assume seems to have solved the hotkey issue if I am not wrong. But looking at the logs, I found out this error:

Mar 31 2017 20:03:47 GMT: WARNING (drv_ssd): (drv_ssd.c:4147) {user-profile-dev} write fail: queue too deep: q 1513, max 1512. My write-block-size is 1M and max-write-cache is 1512M.

This warning occurs if you exceed the disks throughput capabilities.

See:

Yes, was looking at that. Thanks for sharing it. So a number of things I can do here.

  1. Decrease concurrency level further - Not a good idea.
  2. Remodel my schema so that the same key is not hit multiple times from multiple clients at the same time. I think this is the right way to go.
  3. get better aws instances, which I am not so keen on as we will hit the same problem some day or other. In any case we plan to use i3 instances for production use case.

I will update with whichever I go with and how it affects the writes.

Thanks, Soudipta

Agreed, if the root problem isn’t addressed it will likely surface again.

Also wanted to ask if there a error code that I can catch for the queue too deep error. Is it 152.

Its 18. Just looked at the docs.

See shared KB article: