I was trying to figure out a way to have one-to-many relationship in aerospike. My use case is
For a user I want to maintain some multivalued field say for an example “hobbies”. So one user can have multiple hobbies.
Currently I have set as " PK, userId, List -hobbies".
Suppose payload comes with hobbies as “Cricket, Tennis, Football etc”. (this list can go up to hundreds), I want to return all users having these hobbies.
Current implementation: Multiple filters are not supported except in UDF ( I do not want to use UDF). For each hobby, I am adding one filter something like this
List users = new Arraylist<>();
for(String hobby: hobbies){
// add single hobby filter on hobby list bin on user set .
users.add(user);
}
The above implementation works well when we pass small payload of hobbies but as and when data increases in the set and the passed payload of hobbies in large I am getting some bad results (30 secs for one call).
Other thing I was thinking to have is to store all hobbies as single string and do some regex to match the record so only single filter will be used. Is there a way by which I can find records matching the string that I have passed?. Like I query something like this on hobbies bin “CricketTennis*” and it gives all user having these terms in the bin of hobbies.
Predicate filters may work for that, but its definitely not lightning fast. Your implementation sounds really good. I’m curious to understand if you’ve done any troubleshooting on the slow calls when there is a large payload of hobbies?
Records are distributed in Aerospike uniformly across all nodes of the cluster. Any filtering strategy will become an operation that runs on all nodes and returns back results - like scan, secondary index query, stream UDF. Another approach can be “indexing on the fly” - here is what I mean by that - have another set of records: PK, HobbyName, Users-List … so whenever you update the “HobbyList” record, also update the HobbyName record. Then query back all users using the HobbyName record. You can store the PK of users in HobbyName: UserList- then do a Batch Index Read on the list of PKs retrieved.
While load testing the scenario where I have 555 users and each user has 100 hobbies in a list. My query payload contains 100 hobbies which will match around 250 users. Here I am getting per thread call succeeded in 28-30 secs.
I took JFR while performing this test and below is the image of one of the Aerospike thread. There are many aerospike threads in JFR like the image attached since Aerospike code is in MDB. Well the thread containing long stacktrace is because I am storing List as hobbies.
User set - userId, List-hobbies.
Hobby set - PK, hobbyId, List-userId
While saving i am maintaining both set and while retrieving first I call hobby and get all lists of userId and then query back user set. Both query I am doing batch read.
Change in performance now I see is with same set of data with filters I get 16 secs and with the modified approach I get around 800ms performance which seems to be great.
Well that is great - now also be aware of the pitfalls of both approaches.
They both work to give you accurate results in a perfect network and stable cluster.
If you do filtering, and if the cluster undergoes a change - node drops out or new node joins in- you may get partial results or duplicate results. But you can use failOnClusterChange flag to catch that and re-run. Downside is high latency as you have tested.
In the -index on the fly- approach - each record update is now turned into two separate record updates and you must think of all the possible failure patterns.
Which record should I update first - the UserSet or HobbySet? If one write succeeds and other fails, what is my strategy? You must catch write failures and timeouts of each write and decide based on your data model. Is it better to write HobbySet first and then if UserSet never completes, I may subsequently try to read a record that does not have the data - can I deal with that in the reads? Other way around, if update UserSet but fail to update HobbySet and abandon - I may never index this entry. Is there a background job I can use to periodically validate the HobbySet records? Or keep retrying write till it succeeds?
Secondly, if you are using Community Edition, you can only use the AP mode. What happens if the cluster splits? You may end up writing a new index record in HobbySet and lose all previous UserLIst entries when the cluster heals regardless of how the two versions get conflict resolved, the user-lists won’t get merged. You could use strategies such as - don’t create the record if it does not exist and first initialize each HobbySet record with some placeholder entry. This way if the cluster splits, you will not create a brand new HobbySet record in the split sub-cluster. If you are using the Enterprise Edition, use the Strong Consistency mode and it will automatically protect you against this scenario. So while this multi-record update will work for the most part - in the rare split cluster scenario due to networking failure events, you may corrupt your index (HobbySet). If you have a background index (HobbySet records) rebuilding strategy, then that would be a plausible way to deal with it. Again, depends on the criticality of the data model needs. However, any multi-record update model, I would strongly recommend Enterprise Edition Strong Consistency mode of operation.