Storing tabular/PSV data in Aerospike

spark
hadoop
query

#1

We are generating medical statistics that show number of medical visits per practice and location. These are psv (pipe separated) files that have:

location, practice,physician,num_visits, charges, date_of_service

Each entry is a statistic collected per day, per physician per location, per practice and we want to run aggregate statistics on these.

We are going to bulk load these from psv (pipe separated values) file into Aerospike.

My question is about the design of keys and sets.

How do I key each entry here? These are already pre-aggregated values.

Can I store it all as one key (practice name) and store values as big psv file which I would later run aggregate queries on?

My typical queries might be:

  • give me all daily visits for a physician in a practice in a location
  • give me all applications of drugs per practice
  • give me all charges per physician per location etc.

If this is too complicated to design, I might want to pull all values as json by the key, and do aggregations in the node application.


#2

Hi,

There is no simple answer as to the “best way” and it depends on what you want to query at speed and scale. Your data model will reflect how you want to read the data and at what latency and throughput.

If you want high speed (1-5ms latency) and high throughput (100k per second) of a particular piece of data, you will need to aggregate the data as you write it to Aerospike and store it using a composite key that will allow you to get that data quickly e.g. doctor-day-location.

If you want a statistical analysis over a period of time, and the query can take a few seconds to several minutes, then you can store the data in a less structured format and run Aerospike aggregations on it, or even use the Hadoop or Spark directly on the Aerospike data.

Regards


#3

For more on this topic, please see the post from the original poster - and the community’s answers - on the Stack Overflow forum here.