I need some guidance on fine-tuning my Aerospike cluster to handle a large volume of writes

Hi all, :smiling_face_with_three_hearts:

I am building an Aerospike cluster to handle a high volume of data writes (around 100,000 writes per second). I am aiming for optimal performance and stability, but I am facing some write latency spikes, particularly during peak loads.

Here’s my current setup: :innocent:

Cluster Size: 5 Nodes

Instance Type: AWS EC2 m5.xlarge

Storage: Hybrid - In-memory and SSD

Network: Gigabit Ethernet

I have already adjusted some settings based on Aerospike documentation and online resources:

write-block-size: 1 MB

max-write-cache: 256 MB

replication-factor: 2

Despite these tweaks, some nodes get overloaded under peak load. I suspect there might be additional configuration or architecture optimizations I’m missing.

Looking for your expertise on: :thinking:

1. Configuration: Are there other settings I can adjust to improve write throughput and stability?

2. SSD Utilization: What are best practices for optimizing SSD performance in this scenario?

3. Network: Could network settings or hardware upgrades make a significant difference?

I also check this :point_right: http://www.aerospike.com/apidocs/java/com/aerospike/client/policy/record/existsactionreplace But I have not found any solution.

Thanks :heart_eyes:

Respected member

Are you using Server Version 7.1?

Here are some Aerospike configuration parameters that can help optimize your write performance.

1. Configuration

Write Configuration:

  • write-block-size: Your setting of 1 MB is generally good for larger writes. Make sure this aligns with your average object size.
  • max-write-cache: 256 MB is reasonable, but you may need to adjust based on your workload and memory availability.

Memory Configuration:

  • memory-size: Ensure your in-memory configuration is adequate for your dataset and indexes.

Threads and Queues:

  • service-threads: The number of service threads should match the number of CPU cores. For an m5.xlarge instance, this is typically 4.
  • transaction-queues: The number of transaction queues should generally match the number of service threads. For an m5.xlarge instance, you might start with 4.
  • transaction-threads-per-queue: This should be tuned based on your workload. Start with a small number and increase based on performance testing.

Replication Configuration:

  • replication-factor: You have set this to 2, which is good for data redundancy and availability.

2. SSD Utilization

SSD Configuration:

  • Data Layout: Ensure that your SSDs are formatted with a file system optimized for SSD use (e.g., ext4 with specific SSD options or XFS).
  • use-direct-io: Set storage-engine.device.use-direct-io to true to bypass the OS cache and reduce write latency.
  • write-block-size: Already set to 1 MB, ensure this aligns with your object size to avoid unnecessary overhead. Defragmentation:
  • defrag-lwm-pct: Set this to a lower value (e.g., 50-60%) to ensure more frequent defragmentation, preventing sudden spikes when SSDs get too full.

3. Network Optimization

Network Configuration:

  • network.max-msgs-per-type: Increase this value to allow more messages to be processed simultaneously.
  • batch-max-requests: Increase this if you use batch operations, to handle more requests at once.

AWS Specific Settings:

  • Enhanced Networking: Ensure enhanced networking is enabled (e.g., using Elastic Network Adapter (ENA) on AWS) to improve throughput and reduce latency.
  • Placement Groups: Use placement groups to ensure your instances are in close physical proximity, reducing network latency.

Example Configuration Snippet

service {
    service-threads 4
    transaction-queues 4
    transaction-threads-per-queue 4
}
namespace mynamespace {
    replication-factor 2
    memory-size 4G
    default-ttl 30d # 30 days, use 0 to never expire/evict.
    storage-engine device {
        file /opt/aerospike/data/mynamespace.dat
        filesize 16G
        data-in-memory true # Store data in memory in addition to file.
        write-block-size 1M
        use-direct-io true
    }
    defrag-lwm-pct 50
}
network {
    service {
        address any
        port 3000
        access-address <public-ip>
    }
    fabric {
        address any
        port 3001
    }
    info {
        address any
        port 3003
    }
    heartbeat {
        mode mesh
        address any
        port 3002
        mesh-seed-address-port <other-node-ip>:3002
        interval 150
        timeout 10
    }
}

Monitoring and Benchmarking

Monitoring:

  • Use tools like Aerospike Monitoring Stack, Prometheus, Grafana to continuously monitor your Aerospike metrics.
  • Focus on metrics like write-q, write-timeout, client_write_error, disk_write_time, and replication-latency.

Benchmarking:

  • Use Aerospike’s ACT tool to benchmark your SSDs.
  • Perform load testing to understand how your cluster behaves under peak loads.

Summary

By focusing on these Aerospike-specific configuration parameters and ensuring your hardware and network infrastructure are optimized, you should see improvements in write throughput and stability. If you have further questions or need more targeted advice, feel free to provide additional details.