Ranking using score and pagination support in Aerospike

Hi Team, Glad to be back here :slight_smile: Need a little help with Ranking using score, filtering support and pagination support in Aerospike

Sample example :

  1. Rank movie content pool according to me (content size ~= 1 million)
  2. Paginated request to get top ranked 10 movies for me from my content pool
  3. New movies can keep on getting added to my content pool but no deletion
  4. Paginated request to get next top ranked 10 movies for me from my content pool which have not been served to me before and so on…

Below are the other details :

Millions of entities with below structure : 
{
     String entityId
     int score
     String entityMeta
}
1. Store 50k entities data in something similar to Redis Sorted Set. This can come anytime again as well with some other content. Hence the content will be updated time to time with new data and hence position of entities will keep on changing

2. Request comes to get top 10 data.
- Get the top 10 data from Sorted Set
- Add those top 10 in already viewed set (Now already viewed set has 10 elements)

3. Next request comes to get next top 10 data
- Get the top 10 from Sorted Set which does not have the data present in already viewed set
- Add these next top 10 in already viewed set. (Now already viewed set has 20 elements)
Requirement : 

1. Rank entity with score 
2. Filter out the entities which have been already viewed
3. Get top x entities and their entityMeta
4. If user comes again with offset = 0 for a paginated request, requirement is to remove the alreadyViewed Set and start displaying the data to the user from the top again
5. There are other non-paginated requests reading the data from main set. Hence, we cannot modify the main set.

Is Aerospike suitable for this use case. I am also looking out for redis based solutioning but facing few challenges there as well. Not sure, which is most suitable Aerospike or Redis here.

@pgupta Can you also please help here

You should be able to implement this with Aerospike: Store the movie data in a Map {id → (rank, metadata)}, and a user’s viewed movies in a List. For the next top N unviewed items for a user:

  1. Get the user’s viewed list, V.
  2. Get the top N+sizeof(V) items from the movie data, T.
  3. Remove the viewed entries from T, T - V, and take top N items, P.
  4. Add P to the user’s current viewed list. etc.

It should possible to construct one request with this logic using Expressions (see relevant doc here, here, and here), but large data sizes will require storing movie data and user info in multiple records, and accessing them in separate requests.

@neelp Can you give some examples of Expressions. Not getting it exactly from the java docs. Also instead of rank, we have scores. Will map be able to profile the highest score movies Or you mean to say that we do ranking at application level and should keep rank instead of scores here

Please take a look at this interactive Jupyter notebook tutorial on Expressions and related links. If you have data across multiple records, you may use normal CDT operations which are illustrated here. Since you are looking for the highest scoring titles, rank ordering with scores should work [CORRECTION: with negative or reversed ranks]. Also look into the INVERTED return type flag for CDT operations.

Thanks @neelp. Have few queries on this :

We will be having large data, would not be storing meta with movie_id and score for a user separately. Was going through Advanced CDT. Found below time complexities for list operations :

insert_items(0, m) : O(M + N) log M + N

get_by_rank_range :

Ordered list, namespace stores its data on SSD : O(N+M)

Ordered list, namespace stores its data in memory : O(M)

Insertion is already done for ordered list, why is get operation taking O(N+M) when data is saved on SSD?

Also, time complexity for insertion is also extremely high. Can we optimise it further using some other approach?

Please review this post for how to interpret O(n) of CDT operations. The referenced O(n) discussion is about data structure performance, which is typically a fraction of overall request execution. If the data size is large, you can employ techniques to break the data across multiple records, keep the relevant data (say, top 5000 instead on 1M titles) in a record, and so on.