I am doing some PoC on aerospike for my use case. As I understood from the documentation, aerospike can work with strong consistency in the enterprise edition which I am willing to opt for.
Now the thing is, in my use case, a user request can result in updating close to 30K keys in the aerospike cluster and since aerospike can update only one record in a transaction, I am wondering how can I update 30K keys in a transaction.
For a single user transaction to touch 30k records, I donāt know if thatās a sustainable model. There is also no way to lock or atomically operate on more than 1 record at a time. Maybe you can explain your use case and the current data model and we can suggest an alternate approach?
If you must update 30k records, you would need 30k put calls.
Suppose in a company there are 30000 employees. Now suppose HR wants to increment everyoneās salary by 10% ( ) , they will just update in a ssystem saying ābump up everyoneās salary by 10%ā and the system will calculate new salary for all 30000 employees and will store it in aerospike. So as you can see, with one update from a user, the system is going to update 30000 records in aerospike.
Right now, we are using MSSQL where we are storing this salary component as a JSON of map of {empId ā Salary} and when someone updates the salary, we will read and parse the JSON in Java and update values and then again store this JSON in MSSQL.
Ideally we would want to have 30000 records where we are storing key as empId and a bin value as salary, so that it is easier to make adhoc queries and also easy to do other operation like remove employees, aggregate some other data etc.
I hope I am clear on the use case and how we have modeled it currently.
PS: We need a system which can support strong consistency since these are very critical data and we cannot afford losing records.
Well you canāt really do it atomically across 30k records. You could use a UDF to crawl through, store a āupdate procedureā hash (to verify if it has been applied or not) and then perform the action on the record. Youād just send the procedure to the cluster and it would execute it for you. Iām not real sure if thatās the right solution for this though so Iāll let someone else chime in
Aerospike is fast and scalable because it can treat each row independently.
Thatās why there are no master coordinating nodes, no master rollback log, or anything like that which impedes scalability.
The tradeoff you get for multi-million transaction per second per server, even for writes, and scaling linearly, is no multi-record updates.
The database theory in this area has a lot of twists and turns, and is beyond a simple knowledge base article.
And no, UDFs donāt solve the problem. Itās a distributed transaction commit problem, where you have to deal with interrupted transactions, network outages, cluster splits, reads during writes, isolation tradeoffs, and other cases. Not simple.
With SC mode in EE, you will have to do a data model where record state is built in the record. Conceptually, you have a current timestamp in a bin, current salary, new timestamp, new salary. And then you make sure after a scan UDF, your scan completes successfully, use failOnClusterChange option. Re-run till done. UDF should be idempotent. Then run a verification run that all salaries did get updated. You will need to keep track of the exact count of total employees. For e.g. JacksID 20190701 100000 20190801 110000.
Please execuse my ignorance. I am intrigued about your use case here.
What advantages are you going to gain by making Aerospike behave like an RDBMS? Since you have MSSQL, I gather that can do the job pretty well. Which one is the golden source here MSSQL or Aerospike?
This is a simple use case of fan out and fan in approach. Instead of updating all 30k records why cant you update the salary of all 30k users on demand(whenever salary is fetched) by storing a common minus, plus event somewhere.