Thank you for reporting this. This is a shortcoming in the documentation. The number of transaction queues only defaults to the number of CPUs, if you have at least one namespace that’s not in-memory.
In your case, your test
namespace uses storage-engine memory
and your production
namespace uses data-in-memory true
. This means that both namespaces are counted as in-memory. That’s why the number of transaction queues still defaults to 4. Yes, this should be documented in greater detail.
The reasoning is that transactions for in-memory namespaces are not dispatched to the transaction threads, i.e., the transaction queues and threads would be basically idle. Instead, the transactions for in-memory namespaces run directly on the service threads to avoid adding latency.
For non-in-memory namespaces the latency is not as critical, because the major part of the transaction latency is caused by SSD accesses. That’s why it’s OK to dispatch to the transaction threads for them.
A higher number of client connections can point to a temporary latency spike on a node. Each node in a cluster gets roughly the same TPS. Suppose that we have 10,000 TPS per node. Suppose that each transaction takes 1 ms. This means that one client connection can do 1,000 transactions per second, as transactions are sequential per client connection and we’re assuming 1 ms per transaction. At 10,000 TPS, the client would thus open 10 connections, so that we’d end up with 1,000 TPS/con x 10 con = 10,000 TPS.
Now, suppose that there’s a temporary latency spike to 100 ms for node, e.g., because of a network hiccup. Then the client would open 1,000 connections, because we’d now have 10 TPS/con and we’d thus need 1,000 connections to achieve 10,000 TPS.
Currently, the client doesn’t ever reduce the number of open connections and a network hiccup has a lasting effect. This may be the reason why you’re seeing an uneven distribution of client connections across clusters.