Writes failing to aerospike cluster

sdas · March 31, 2017, 12:22am

Hi,

I am new to using aerospike. For my use case, I am reading data from EMR and writing to an aerospike cluster of 2 AWS instances which are of m4.2xlarge instance types. One thing that I am noticing is that the successfull TPS is almost half of the total TPS during the write job. I am attaching the snapshot for reference. So if the messages are actually getting dropped, is there any paramter in the Aerospike client that I can use to guard against it.

Thanks,

kporter · March 31, 2017, 12:47am

Aerospike version?
Grep logs for a line similar to:
```
{ns_name} client: tsvc (0,0) proxy (0,0,0) read (126,0,1,3) write (2886,0,23) delete (197,0,1,19) udf (35,0,1) lang (26,7,0,3)
```
Share a few during this event - For details see: http://www.aerospike.com/docs/reference/serverlogmessages
Share namespace stats:
```
asadm -e "show stat namespace"
```
For details see: Metrics Reference | Aerospike Documentation

sdas · March 31, 2017, 1:45am

Hi,

I am using Aerospike version 3.10.1.1.

I am attaching the screenshot for the logs when the client was doing the writes

For node 2 it was

Mar 31 2017 01:21:46 GMT: INFO (info): (ticker.c:551) {namespace-dev} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (19893,4454,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0) Mar 31 2017 01:21:56 GMT: INFO (info): (ticker.c:551) {namespace-dev} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (19893,4454,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0) Mar 31 2017 01:22:06 GMT: INFO (info): (ticker.c:551) {namespace-dev} client: tsvc (0,0) proxy (0,0,0) read (0,0,0,0) write (19893,4454,0) delete (0,0,0,0) udf (0,0,0) lang (0,0,0,0)

Also if you can tell me which parameters from the stat command you are specifically interested in, I can add them to the comment.

Thanks, Soudipta

kporter · March 31, 2017, 4:45am

asadm -e "show stat namespace for <namespace name> like client write fail"

sdas · March 31, 2017, 7:25am

Hi Kevin, Thanks for the pointer. I am attaching the screen shot of the stat command

There is quite a few events that had been dropped during the writes as I can see from the client_write_error.

I am currently having my client do threaded writes and I think that is adding load to the cluster. I am curious to know if there is any parameter than we can set in the aerospike server that can perform threaded writes to the disk.

Thanks, Soudipta

kporter · March 31, 2017, 4:56pm

The value of fail_key_busy would indicate you are hitting one or more hotkeys. Which happens when you have more than transaction-pending-limit transactions queued to a single key.

You should do your best to design your application to avoid hot keys. You can also try (if your use case allows) using the replace or replace-only record-exists policies which can significantly improve write performance.

kporter · March 31, 2017, 4:59pm

In this case the client will receive error code 14 - key busy response from the server.

pgupta · March 31, 2017, 5:05pm

You could also try increasing the transaction-pending-limit. If that reduces the problem, you are on the right track. Increasing the transaction-pending-limit will adversely affect write latency. So not the best way to fix this problem. You need to mitigate the hot key by modeling your data differently.

asadm
Admin>asinfo -v 'set-config:context=service;transaction-pending-limit=40'

sdas · March 31, 2017, 8:09pm

I think I have reduced the hotkey issue by reducing the number of threads on my client while doing the writes. Maybe this is just interim and like you are suggesting remodelling the schema is the correct way to go. But coming back the issue, I still see dropped writes after reducing the number of threads, only this time I dont see fail_key_busy count to be increased. Which I assume seems to have solved the hotkey issue if I am not wrong. But looking at the logs, I found out this error:

Mar 31 2017 20:03:47 GMT: WARNING (drv_ssd): (drv_ssd.c:4147) {user-profile-dev} write fail: queue too deep: q 1513, max 1512. My write-block-size is 1M and max-write-cache is 1512M.

kporter · March 31, 2017, 8:36pm

This warning occurs if you exceed the disks throughput capabilities.

See:

sdas · March 31, 2017, 8:45pm

Yes, was looking at that. Thanks for sharing it. So a number of things I can do here.

Decrease concurrency level further - Not a good idea.
Remodel my schema so that the same key is not hit multiple times from multiple clients at the same time. I think this is the right way to go.
get better aws instances, which I am not so keen on as we will hit the same problem some day or other. In any case we plan to use i3 instances for production use case.

I will update with whichever I go with and how it affects the writes.

Thanks, Soudipta

kporter · March 31, 2017, 9:55pm

Agreed, if the root problem isn’t addressed it will likely surface again.

sdas · March 31, 2017, 9:59pm

Also wanted to ask if there a error code that I can catch for the queue too deep error. Is it 152.

sdas · March 31, 2017, 10:05pm

Its 18. Just looked at the docs.

kporter · March 31, 2017, 10:05pm

See shared KB article:

Topic		Replies	Views
Hot write cause many slow writes in the whole cluster?	20	3153	January 19, 2022
The TPS of Aerospike to insert sorted map	8	1312	June 6, 2017
Java client exception : Error Code 22: Operation not allowed at this time Java Client java	7	3034	August 3, 2021
Write performance in multi-node clusters? Operations	10	4743	May 1, 2015
Bulk/Batch Updates (AER-6499) Delivered Requests	19	7709	June 9, 2022

Writes failing to aerospike cluster

Related topics