I'm I able to do a unique scan based on bins?

Arlus · February 26, 2019, 4:37pm

Hello,

I’m a newbie to Aerospike. I’m using the C client library and I have had a taste of UDFs. The issue I have is; is there a way for me to scan or query the following way in Aerospike?

select all records with unique column_A and column_B combinations.

Thanks.

kporter · February 28, 2019, 12:58am

Could you expand on what you are trying to do? As stated it doesn’t sound like something we support, but if with clarification we may be able to help.

What would be the expected result if it were run on the following data set:

column_A,column_B, column_Z
a,a,a
a,a,b
a,b,a
b,a,a

Arlus · February 28, 2019, 6:05am

@kporter I would expect the following result from your dataset:

a,a a,b b,a

As for more details, I have posted another question before though I feared I got too specific. Here is the link Complex queries

kporter · March 1, 2019, 5:46pm

I answer there, but for this particular query you could have column_C = hash(column_A,column_B) and create a secondary index on column_C. You would then need to use a stream_UDF to filter out duplicate hash values.

Albot · March 1, 2019, 7:50pm

I keep forgetting, you guys don’t allow predicate filters on a query do you?

kporter · March 1, 2019, 10:03pm

Oh you can do a predicate filter on queries, but I’m not sure how you would implement unique with them.

pgupta · March 3, 2019, 11:56pm

If you want to compare records, you cannot do it natively. So have to use some trick at the time of record creation. As @kporter mentions, hash in the third column is one way. Depending on the number of records in your data set, you can have a lookup record with a map type bin where key is hash of column a + column b and value is the record’s digest or userkey if they all belong to the same set. In a simple case, the lookup record - R1 - has { k1:v1, k2:v2, k3:v3… etc} updated every time a record is created. where k1 = hash(bin a + bin b) and v1=userkey or digest. If you have huge number of records then you will need multiple records where each R"n"'s key = some significant number bits of the hash(bin a + bin b) - so they go to a specific lookup record. For example, if k1 = 20 byte hash, use its least significant first byte as the key for its lookup record. Your total records set is now split into 256 records → lookup maps. Using size of maps, you can aggregate and find total number of unique records or using the value in each map key - go fetch those unique records for additional processing.

The only issue I see with this scheme is if you update bins a and b of the same record. It will work if you never update them. If you want to update, then go with @kporter 's method. Stream UDFs are not my preferred choice if cluster is going through changes - like node addition or node dropping out.

On second thoughts, If you do want to update, then the value will have to be made into a list of digests and checked if this record is the only entry before updating - then delete that k,v, pair and insert new one where ever it belongs. So, you will have { k1:[va, vb, vc…], k2:[vn, vm, vp …] … } etc. (You will have to use GEN_CHECK_EQUAL or check and set technique to update the k:v entry)

Topic		Replies	Views
Unique validation on bin values udf , index	4	2461	July 19, 2018
How can I query data from a record bins if they are not indexed	1	1231	May 11, 2018
How to search bin values that contains or starts with the to be searched value	2	4808	March 2, 2015
Get records whose bin values in some set Query & Indexing	1	874	May 5, 2020
AQL command for getting the count of distinct values in a particular bin of the set AQL	2	2451	December 21, 2018

I'm I able to do a unique scan based on bins?

Related topics