Calculate percentiles on real time running performance with Aerospike!


#1

Hi all,

I am working on a problem. We have a large set of data which contains 4 million records. Saying we have 2 bins naming basic_salary and gender staying in the same set. Each bin has 4 millions rows. Now, I want to calculate the percentiles of basic salary where gender is Male or female. One of the tricky requirement is to compute the percentiles in real time running performance like saying 1-2 seconds.

I went through the Aerospike document and I have a feeling that MapReduce with Clustering system might be the answer. However, I have no idea how to build that kind of system because I have worked with Aerospike for just few days.

Any body has the same experience can help me to figure out the solution for my real problem?

I highly appreciated that.

Many thanks


#2

This is not use case for key-value databases. Hadoop mapreduce is possible but you cant have such low latency and you need hadoop cluster. I would try some streaming system e.g. Kafka Stream.


#3

You can use stream udfs in Aerospike to do aggregations. First, each record has two bins - so “each bin has 4 million rows” is sort of incorrect to say. You have 4 million records with each record having two bins - salary and gender. You can build a secondary index on gender and feed the results of the SI query into a stream udf to compute the percentiles … fairly straightforward. Will have to test and see on the 1 to 2 seconds timing requirement though. Can you provide an example with 10 sample records on the exact computation that you want?


#4

Hi @pgupta

Thank you for your answer. I have tried to implement doing calculation percentiles on real time but I have not finished that. My init solution is

  1. I init 4M records and put to Aerospike by running a for loop from 1 to 4M. Each key is the index or iterated number of the loop; each key has 2 bins which are basic_salary and gender.

  2. I have a simple method call FilterByGender. Inside the method, I have created the secondary index on Gender bin. Register UDF function. Inside the udf function, in order to calculate the percentiles I have to sort all the filtered record first. After that, I will calculate the percentile on the sorted list.

I am writing the UDF function and it is not easy for me as LUA is new language for me. However, I got a feeling that for whatever reasons, if you try to sort the list then you have to spend O(nlogn) time complexity. You can imagine that N is very large now then. I might be wrong but I think using only one server/node is not enough. I think we might need to use distributed system but I have very limited knowledge on Aerospike so I am trying my best to implement the solution on Distributed system using Aerospike.

Do you have any ideas on that?

Many thanks,


#5

You may have to get specific with your solution rather than figure out a general purpose solution. One trick can be duplicate storing all the male salaries in a separate record as a sorted list. The largest record size is 8MB in Aerospike - so you can see if it can cover your entire data set. LIkewise storing all female salaries in a second record as a sorted list. You still have your record of employees, just update the salary record also every time you insert or update the employee record. Now you have an easier problem at hand. If 8MB still falls short, you may have to get clever and see if you can segregate your data by storing it in pre-defined ranges into multiple 8MB records. Sorted lists are available in Aerospike. That will be an alternate to the UDF route. You can do sorting in the reduce function of a stream UDF but it will not be practical for the entire data set. You might want to read this thread: Order by option