Distinct bin udf


I am running simple exercises over iris dataset (very small) to check how some features can be used.

I have coded a distinct UDF like this:

local function map_generic(rec)
    local names = record.bin_names(rec)
    local ret = map{}
    for i, name in ipairs(names) do
        ret[name] = rec[name]
    return ret

function distinct_bin(stream, bin)
  local function accumulate(currentList, nextElement)
    local key = nextElement[bin]
    if currentList[key] == nil then
      currentList[key] = 1
    return currentList

  local function mymerge(a, b)
    return a

  local function reducer(this, that)
    return map.merge(this, that, mymerge)

  return stream : map(map_generic) : aggregate(map{}, accumulate) : reduce(reducer)

and have some client in python:

import aerospike
import aerospike.predicates as p
import sys
import os
from config import config, NAMESPACE

client = aerospike.client(config).connect()
client.index_integer_create(NAMESPACE, 'iris', 'petal_width', 'idx_petal_width')

def get_distinct(bin):
    res = []

    def add_record(record):

    query = client.query(NAMESPACE, 'iris')
    query.apply('distinct', 'distinct_bin', [bin])
    return res[0]

res = get_distinct('species')
assert res == ['setosa',  'versicolor', 'virginica' ]

res = get_distinct('petal_length')

The questions are:

The way to achieve distinct appears ugly because map values are not interesting at all. Are there a more elegant way to reach the solution?

Aggregate/reduce returns one unique result. If size of distinct values grows, when this aproach start to be a performace problem?


i think maper can be coded better being less generic:

  local function mapper(rec)
      local ret = map{}
      ret[bin] = rec[bin]
      return ret

I want to return only the keys of final map. Can I apply another reducer for this?


If you are experimenting with UDFs and/or want to accumulate some values (like unique bins), what you are doing is the right way to go. I cannot think of a significantly better approach. Coming to your second question, obviously, as the size of the map grows, its going to take a performance hit. But as you are only maintaining only integer for each key, it will not be too bad. You should cross check by experimenting with large number of unique bins.

However, if you main intention is to know the unique bin count (or names) and don’t care if you use UDFs or not, you can use the info command API and send the command ‘bins’ (output format). You need to send this command to each node in the cluster. This approach is far more lightweight because it does not need to do an I/O at all. We just walk the metadata of the namespace and get the info. Note that this info command gives bins of the entire namespace. You cannot restrict it to the records that match an arbit secondary index query.

Thanks @sunil,

Yes, I am experimenting and trying to get the “aerospike way” to do things.

Which one of following aproaches would be better with aerospike?

  1. read entire set filtering by secondary index: around 10M records of 200M total records

  2. do it in two phases: first extract distinct values (35k) for bin and later do N ranged queries with between operator

how comfortable would aerospike feel reading large dataset? is there a “best practices” for this issue?

Thanks again

From the info that you shared, I think option-1 is better. As you get everything in one shot. Secondary index queries are good when it returns a bunch of records in one go. This will avoid needing to do many back and forth with N different queries.