Distinct bin udf

udf
python

#1

Hi,

I am running simple exercises over iris dataset (very small) to check how some features can be used.

I have coded a distinct UDF like this:

local function map_generic(rec)
    local names = record.bin_names(rec)
    local ret = map{}
    for i, name in ipairs(names) do
        ret[name] = rec[name]
    end
    return ret
end


function distinct_bin(stream, bin)
  local function accumulate(currentList, nextElement)
    local key = nextElement[bin]
    if currentList[key] == nil then
      currentList[key] = 1
    end
    return currentList
  end

  local function mymerge(a, b)
    return a
  end

  local function reducer(this, that)
    return map.merge(this, that, mymerge)
  end

  return stream : map(map_generic) : aggregate(map{}, accumulate) : reduce(reducer)
end

and have some client in python:

import aerospike
import aerospike.predicates as p
import sys
import os
sys.path.append(os.path.abspath('../'))
from config import config, NAMESPACE

client = aerospike.client(config).connect()
client.udf_put('./distinct.lua')
client.index_integer_create(NAMESPACE, 'iris', 'petal_width', 'idx_petal_width')

def get_distinct(bin):
    res = []

    def add_record(record):
        res.append(record.keys())

    query = client.query(NAMESPACE, 'iris')
    query.apply('distinct', 'distinct_bin', [bin])
    query.foreach(add_record)
    res[0].sort()
    return res[0]


res = get_distinct('species')
print(res)
assert res == ['setosa',  'versicolor', 'virginica' ]

res = get_distinct('petal_length')
print(res)

The questions are:

The way to achieve distinct appears ugly because map values are not interesting at all. Are there a more elegant way to reach the solution?

Aggregate/reduce returns one unique result. If size of distinct values grows, when this aproach start to be a performace problem?

Thanks


#2

i think maper can be coded better being less generic:

  local function mapper(rec)
      local ret = map{}
      ret[bin] = rec[bin]
      return ret
  end

I want to return only the keys of final map. Can I apply another reducer for this?

Thanks


#3

If you are experimenting with UDFs and/or want to accumulate some values (like unique bins), what you are doing is the right way to go. I cannot think of a significantly better approach. Coming to your second question, obviously, as the size of the map grows, its going to take a performance hit. But as you are only maintaining only integer for each key, it will not be too bad. You should cross check by experimenting with large number of unique bins.

However, if you main intention is to know the unique bin count (or names) and don’t care if you use UDFs or not, you can use the info command API and send the command ‘bins’ (output format). You need to send this command to each node in the cluster. This approach is far more lightweight because it does not need to do an I/O at all. We just walk the metadata of the namespace and get the info. Note that this info command gives bins of the entire namespace. You cannot restrict it to the records that match an arbit secondary index query.


#4

Thanks @sunil,

Yes, I am experimenting and trying to get the “aerospike way” to do things.

Which one of following aproaches would be better with aerospike?

  1. read entire set filtering by secondary index: around 10M records of 200M total records

  2. do it in two phases: first extract distinct values (35k) for bin and later do N ranged queries with between operator

how comfortable would aerospike feel reading large dataset? is there a “best practices” for this issue?

Thanks again


#5

From the info that you shared, I think option-1 is better. As you get everything in one shot. Secondary index queries are good when it returns a bunch of records in one go. This will avoid needing to do many back and forth with N different queries.