I'm I able to do a unique scan based on bins?

secondary
scan
udf
#1

Hello,

I’m a newbie to Aerospike. I’m using the C client library and I have had a taste of UDFs. The issue I have is; is there a way for me to scan or query the following way in Aerospike?

select all records with unique column_A and column_B combinations.

Thanks.

#2

Could you expand on what you are trying to do? As stated it doesn’t sound like something we support, but if with clarification we may be able to help.

What would be the expected result if it were run on the following data set:

column_A,column_B, column_Z
a,a,a
a,a,b
a,b,a
b,a,a
#3

@kporter I would expect the following result from your dataset:

a,a a,b b,a

As for more details, I have posted another question before though I feared I got too specific. Here is the link Complex queries

#4

I answer there, but for this particular query you could have column_C = hash(column_A,column_B) and create a secondary index on column_C. You would then need to use a stream_UDF to filter out duplicate hash values.

#5

I keep forgetting, you guys don’t allow predicate filters on a query do you? :slight_smile:

#6

Oh you can do a predicate filter on queries, but I’m not sure how you would implement unique with them.

#7

If you want to compare records, you cannot do it natively. So have to use some trick at the time of record creation. As @kporter mentions, hash in the third column is one way. Depending on the number of records in your data set, you can have a lookup record with a map type bin where key is hash of column a + column b and value is the record’s digest or userkey if they all belong to the same set. In a simple case, the lookup record - R1 - has { k1:v1, k2:v2, k3:v3… etc} updated every time a record is created. where k1 = hash(bin a + bin b) and v1=userkey or digest. If you have huge number of records then you will need multiple records where each R"n"'s key = some significant number bits of the hash(bin a + bin b) - so they go to a specific lookup record. For example, if k1 = 20 byte hash, use its least significant first byte as the key for its lookup record. Your total records set is now split into 256 records -> lookup maps. Using size of maps, you can aggregate and find total number of unique records or using the value in each map key - go fetch those unique records for additional processing.

The only issue I see with this scheme is if you update bins a and b of the same record. It will work if you never update them. If you want to update, then go with @kporter 's method. Stream UDFs are not my preferred choice if cluster is going through changes - like node addition or node dropping out.

On second thoughts, If you do want to update, then the value will have to be made into a list of digests and checked if this record is the only entry before updating - then delete that k,v, pair and insert new one where ever it belongs. So, you will have { k1:[va, vb, vc…], k2:[vn, vm, vp …] … } etc. (You will have to use GEN_CHECK_EQUAL or check and set technique to update the k:v entry)

1 Like