I am running simple exercises over iris dataset (very small) to check how some features can be used.
I have coded a distinct UDF like this:
local function map_generic(rec)
local names = record.bin_names(rec)
local ret = map{}
for i, name in ipairs(names) do
ret[name] = rec[name]
end
return ret
end
function distinct_bin(stream, bin)
local function accumulate(currentList, nextElement)
local key = nextElement[bin]
if currentList[key] == nil then
currentList[key] = 1
end
return currentList
end
local function mymerge(a, b)
return a
end
local function reducer(this, that)
return map.merge(this, that, mymerge)
end
return stream : map(map_generic) : aggregate(map{}, accumulate) : reduce(reducer)
end
and have some client in python:
import aerospike
import aerospike.predicates as p
import sys
import os
sys.path.append(os.path.abspath('../'))
from config import config, NAMESPACE
client = aerospike.client(config).connect()
client.udf_put('./distinct.lua')
client.index_integer_create(NAMESPACE, 'iris', 'petal_width', 'idx_petal_width')
def get_distinct(bin):
res = []
def add_record(record):
res.append(record.keys())
query = client.query(NAMESPACE, 'iris')
query.apply('distinct', 'distinct_bin', [bin])
query.foreach(add_record)
res[0].sort()
return res[0]
res = get_distinct('species')
print(res)
assert res == ['setosa', 'versicolor', 'virginica' ]
res = get_distinct('petal_length')
print(res)
The questions are:
The way to achieve distinct appears ugly because map values are not interesting at all.
Are there a more elegant way to reach the solution?
Aggregate/reduce returns one unique result. If size of distinct values grows, when this aproach start to be a performace problem?
If you are experimenting with UDFs and/or want to accumulate some values (like unique bins), what you are doing is the right way to go. I cannot think of a significantly better approach. Coming to your second question, obviously, as the size of the map grows, its going to take a performance hit. But as you are only maintaining only integer for each key, it will not be too bad. You should cross check by experimenting with large number of unique bins.
However, if you main intention is to know the unique bin count (or names) and don’t care if you use UDFs or not, you can use the info command API and send the command ‘bins’ (output format). You need to send this command to each node in the cluster. This approach is far more lightweight because it does not need to do an I/O at all. We just walk the metadata of the namespace and get the info. Note that this info command gives bins of the entire namespace. You cannot restrict it to the records that match an arbit secondary index query.
From the info that you shared, I think option-1 is better. As you get everything in one shot. Secondary index queries are good when it returns a bunch of records in one go. This will avoid needing to do many back and forth with N different queries.