In Java using Batch, query the data using the secondary index

Primary Key :: EmployeeIdValueEmployeeIdType

Secondary Index :: AlternateEmployeeId

1.we are fetching all the keys from the input and storing it in the KeystoFetch.

    List<String> keysToFetch = new ArrayList<>();
    for (Employee employee: body.getEployees()) {
            keysToFetch.add(employee.getIdValue() + employee.getIdType());
    }

2. Preparing the Key

    Key[] keySize = new Key[keysToFetch.size()];
    for (int i = 0; i < keySize.length; i++) {
        keySize[i] = new Key(SchmeaName, TableName, keysToFetch.get(i));
    }

3. we are using the batch to query the data.

    Record[] records = aspClient.get(null, keySize); 
    List<String> alternateEmployeeIdList= new ArrayList<>();
    for (Record record : records) {
        if (record != null) {
            String alternateEmployeeId= record.getString("AlternateEmployeeId");
            if (alternateEmployeeId!= null) {
                alternateEmployeeIdList.add(alternateEmployeeId);
            }
        }
    }

4. Now I got all the alternateEmployeeIdList .With the help alternateEmployeeIdList i need to query the database to get the Primary key and other bins like (location, role, number).

Below are the solutions which i tried. Its not taking all the alternateEmployeeIdList in the filter. can someone suggest how to resolve this issue.

    Statement statement = new Statement();
    statement.setNamespace(schema);
    statement.setSetName(tablename);
    statement.setBinNames("location","role","number");

    for(String alternateEmployeeId: alternateEmployeeIdList){
        **statement.setFilter(Filter.contains("AlternateEmployeeId",IndexCollectionType.LIST,alternateEmployeeId));**
    }
    RecordSet recordSet = aspClient.query(new QueryPolicy(), statement);

What is your records data layout (schema)? Do you have one set of records that give you the AlternateEmployeeId and another set of records that have the location/role etc along with AlternateEmployeeId? As far as a Secondary Index query goes - you can filter on one string for equality. You cannot pass a list of strings and say, any that match, send me the record.

Using this API: static Filter contains(String name, IndexCollectionType type, String value, CTX... ctx) Create contains string filter for query on collection index.

is saying: I have bin named ‘name’, it has a list of strings (if you do IndexCollectionType.LIST) and if any of those list values match to ‘value’, send me the record.

For example: records have these bins - myfriends and location:
{ "myfriends": [ "Brian", "Josh", "Andy", "Kevin"] , "location": "CA" }

I can do an SI search for returning all records where bin myfriends contains “Thomas”.

What you cannot do is pass a list: Find all records where myfriends has any of ["Thomas", "Mark" , "Brad"]

This loop that you have, is just updating the filter to match the last string value in the loop.

for(String alternateEmployeeId: alternateEmployeeIdList){ statement.setFilter(Filter.contains("AlternateEmployeeId",IndexCollectionType.LIST,alternateEmployeeId)); }

You are updating the Filter in a loop - the last string in the loop becomes the Filter.

==> if you run your query also in the for loop, you will get your results - each match will be a separate query - not sure that is what you want to do in terms of performance.

Please find the data layout.

Schema: GlobalUsers

Set Name: Employee

Bins :: EmployeeIdValueEmployeeIdType, AlternateEmployeeId**(Secondary Index)**, location, role, number

Data:

| EmployeeIdValueEmployeeIdType | AlternateEmployeeId | location | role | number |

| 1234Contractor | v564 | USA | Tester | 786 |

| 3564FullTime | v743 | UK | developer| 654|

| 4739Consignment | v836 | IN | Analyst | 347|

| 9454W2 | v836 | CA | modeler| 876|

Based on the List of **AlternateEmployeeId [ v836,v743,v564] ** that we have, we need to query the database to get the other bins data(EmployeeIdValueEmployeeIdType, location, role, number).

Using batch we hit the database once, to get all the data that is related to the AlternateEmployeeId and display the results.

How big is this list? 3 typically? [ v836,v743,v564] - and I assume each person has only one AlternateEmploeeyId as you shown - not multiple right? And no two individuals have same alternateid - these two records have same v836 … typo?

| 4739Consignment | v836 | IN | Analyst | 347|

| 9454W2 | v836 | CA | modeler| 876|

The list contains more than 50,000 records. Each person can have multiple AlternateEmploeeyId’s.

EmployeeIdValueEmployeeIdType is the primary key where the records are unique. For the rest of the bins/columns, it’s not.

| EmployeeIdValueEmployeeIdType | AlternateEmployeeId | location | role | number |

| 1234Contractor | v564 | USA | Tester | 786 |

| 3564FullTime | v743 | UK | developer| 654|

| 4739Consignment | v836 | IN | Analyst | 347|

| 9454W2 | v836 | CA | modeler| 876|

| 484Contractor | v564 | USA | Tester | 786 |

| 110FullTime | v743 | UK | developer| 654|

| 312Consignment | v836 | IN | Analyst | 347|

| 299W2 | v836 | CA | modeler| 876|

This is a reverse lookup data model problem. For starters, you can try this construct.

QueryPolicy qp = new QueryPolicy();
qp.shortQuery = true;
for(String alternateEmployeeId: alternateEmployeeIdList){
 
 statement.setFilter(Filter.contains("AlternateEmployeeId",IndexCollectionType.LIST,alternateEmployeeId));
 RecordSet recordSet = aspClient.query(qp, statement);
 // processThisRecordSet(recordSet);
 // loop back to next alternateEmployeeId
}

[/quote]

1 Like

Please correct me if I’m wrong, but it appears that the above query( RecordSet recordSet = aspClient.query(qp, statement);) is calling the database each time for, let’s say, if we have 1000 records, then it’s calling the database 1000 times. Shouldn’t we use batch processing to hit the database once for all the 1000 records and retrieve the data?

Below Code works but i would like to do a batch call to the DB.

  for (String alternateEmployeeId : alternateEmployeeIdList) {
        statement.setFilter(Filter.equal("AlternateEmployeeId", alternateEmployeeId));
        RecordSet recordSet1 = aspClient.query(qp, statement);
        while (recordSet1.next()) {
            Record record = recordSet1.getRecord();
            System.out.println(recordSet1.toString());
        }
    }

We have some confusion or ambiguity in our mutual understanding of the word “record”. In Aerospike lingo, a record is a PrimaryKey+Data and is accessible via its Primary Key and has data in bins. In your case: Primary Key is EmployeeIdValueEmployeeIdType and data bins are: | AlternateEmployeeId | location | role | number |
So when you say “the list contains more than 50,000 records” or “if we have a 1000 records” … I assume you mean the query list of AlternateEmployeeIds is of size 1000.

If you don’t know the PrimaryKey - you can either query all the records and return those that match a bin value (Filter Expressions) or build a secondary index on a bin value (in your case a string type) and use the SI query to return matching records for one specific string value.

In your case, you want records back matching any string value in your list in the record’s bin. So if you want to do an SI query, you have to hit the database once for each string value for equality match. This is the code that now works for you.

A single record read is reading a record whose Primary Key is known - using the Primary Key. A batch read is reading a set of records whose Primary Keys are known. This is what you are doing in parts 1, 2, 3 of your initial question. I had asked if this batch read you are doing is a different data set or the same as you are running the SI query on? I did not see that response. So I am not sure what data steps 1,2 and 3 are running on to get the AlternateEmployeeID list. If it is the same data, then you already have all the record data.

Back to your question about batch reading using AlternateEmploeeID list - you cannot batch read in Aerospike using a bin value as your batch key. You need a list of Primary Keys to do a batch read (as you are already doing in your steps 1, 2 and 3.)

Hope that clarifies.

1 Like

I am using only one data set here i.e

  • Schema: The schema for our database is named “GlobalUsers,” and the set within this schema is called “Employee.”
  • Bins: The bins within this schema include “EmployeeIdValueEmployeeIdType,” “AlternateEmployeeId” (which serves as a Secondary Index), “location,” “role,” and “number.”

Data Example

EmployeeIdValueEmployeeIdType AlternateEmployeeId location role number
1234Contractor v564 USA Tester 786
3564FullTime v743 UK Developer 654
4739Consignment v836 IN Analyst 347
9454W2 v836 CA Modeler 876
484Contractor v564 USA Tester 786
110FullTime v743 UK Developer 654
312Consignment v836 IN Analyst 347
299W2 v836 CA Modeler 876
  • The Primary Key (PK), represented by “EmployeeIdValueEmployeeIdType,” is unique and cannot be empty, while the Secondary Index (SI), represented by “AlternateEmployeeId,” may or may not be null.

I had asked if this batch read you are doing is a different data set or the same as you are running the SI query on? :: I am running the SI query on the Same data set.

Back to your question about batch reading using AlternateEmploeeID list - you cannot batch read in Aerospike using a bin value as your batch key. You need a list of Primary Keys to do a batch read (as you are already doing in your steps 1, 2 and 3.) → Is it not possible to use a list of SI keys to perform a batch read like we did with a list of Primary Keys?

Clarification

EmployeeIdValueEmployeeIdType = [1234Contractor, 4739Consignment, 3564FullTime]

  • In steps 1, 2, and 3, batch processing (list of keys) was utilized to query the database based on the primary key to retrieve data. This process worked fine with no issues.

  • Now, with the “alternateEmployeeIdList” containing values like [v564, v836, v743], the intention is to query the database using the Secondary Index (AlternateEmployeeId) using a batch approach.

  • However, the current solution is not utilizing batch processing effectively. It’s hitting the database multiple times instead of making a single call for all the Secondary Index data.

This solution works for querying using the Secondary Index (SI) but doesn’t utilize batch processing efficiently, resulting in multiple database calls instead of a single batch call.

 for (String alternateEmployeeId : alternateEmployeeIdList) {
    statement.setFilter(Filter.equal("AlternateEmployeeId", alternateEmployeeId));
    RecordSet recordSet1 = aspClient.query(qp, statement);
    while (recordSet1.next()) {
        Record record = recordSet1.getRecord();
        System.out.println(recordSet1.toString());
    }
}

To address this, we need to explore alternative approaches that effectively utilize batch processing to query using the Secondary Index (AlternateEmployeeId). Do we have any ?

First let us also agree what we mean by the Primary Key in Aerospike. It is a string value (in your case) that is the key to the record. It is not stored with the record. Looks like you are also storing it in a data bin.

Typically - I would have a record: PrimaryKey = “key1” and data as: “name” “age” … two bins lets say. “key1” by default is stored as a 20 byte hash in Aerospike. Original “key1” is not stored by default.

So what you have in Aerospike is: hash of “key1” → { “name”:“Jack”, “age”:23} - that is a record in Aerospike.

What you done is: hash of “EidLocid” → { “EidLocId”: “1234Contractor”, “AltId”: “v564” … etc. } which is fine. You can store your string primary key as a data bin also if you want. But to Aerospike that is just a data bin - not a “PrimaryKey”. So when you are using Aerospike APIs and they are talking about Key and Bins - hope this clarifies the difference.

==> So answer to your question : Is it not possible to use a list of SI keys to perform a batch read like we did with a list of Primary Keys? is : Unfortunately, no.

Internally, the effective implementation is: [SI Value] → {PrimaryKey1, PrimaryKey2, PrimaryKey3 …} and you can query on for a single SI value at a time.

But you can create this association in your initial batch run when you fetch the AlternateEmployeeIDs and build your list. Instead build a map - eachID → associated with record key that you are sending in the batch. e.g. you build this map {“v836”: [ “312Consignment”, “299W2”] , “v743”:[“3564FullTime”, “110Fulltime”], …} in your application.

Which brings me to the next question. How are you building this key set?

List<String> keysToFetch = new ArrayList<>();
    for (Employee employee: body.getEployees()) {
            keysToFetch.add(employee.getIdValue() + employee.getIdType());
    }

This is not from data stored in Aerospike is it?

Time permitting, I will try to code this to validate my suggestion, but hopefully you get what I am trying to say and try on your own as well. Once you have this map of AltIds to your Primary Keys in your application you can build the subset of Primary Keys for the list of AltIds and then do a batch read.

I assume you build this mapping like nightly or something and then use the batch read at run time. Somehow you will have to manage underlying data updates to reflect in the mapping.

Thank you for the clarification.

So when you are using Aerospike APIs and they are talking about Key and Bins → Makes sense.

Which brings me to the next question. How are you building this key set? → We are storing all the inputs we received from the consumer, in thekeysToFetch List, which is not from data stored in Aerospike.

I must follow the 1,2,3 steps to get the secondary index bin i.e. ( AlternateEmployeeId).

I think Mapping might complex the things. we are not doing any nightly runs. We are building a Java Spring Boot API component.

If we can use the batch for the Primary Key Bin to query the data, why can’t we use the same for the SI bin?

I think you are still thinking in terms of SQL and Relational Databases. Based on how you are querying the data, you don’t need the EmployeeIdValueEmployeeIdType bin in your data. Primary Index Key is not by a data column in Aerospike.

There is no “table” schema. Each record is its own schema and is accessible by its key which is a 20 byte hash. I can have one record with bins: “name” , “age” and next record in the same set “city” “state”, and Aerospike would not object.

"key1" --> 0x45...abc --> {"name": "jack", "age":23}
"key2" ---> 0x56...45d  --> {"city":"LA", "state":"CA"} 

…in the same namespace and set … and Aerospike would not care. (Of course not a practical data model.)

Key(“GlobalUsers”,“Employee”, “1234Contractor”) ==> 0xf8305…456a … a 20 byte hash.

In Aerospike, you have: 0xf8305…456a → Bin Data. ( AlternateEmployeeId | location | role | number) – I believe that’s all you need.

You don’t need to store EmployeeIdValueEmployeeIdType in Aerospike unless you really want to read it back as “data” also.

So, its not like you have table schema in Aerospike and one column primary key and another is a secondary key - and you can do a batch read on the secondary key - you can’t.

The way you have the code working, firing multiple SI queries to match each string value, is one way to do it currently. Now, is it possible for the server to do this iteration if you passed a list? - I would think it should be possible but that will be a new Feature Request to the Product team. We currently support a range of integers besides equality on integer value SIs. This would extend it to a “set of distinct strings, provided as a list”.

  • Schema: The schema for our database is named “GlobalUsers,” and the set within this schema is called “Employee.”
  • Bins: The bins within this schema include “EmployeeIdValueEmployeeIdType,” “AlternateEmployeeId” (which serves as a Secondary Index), “location,” “role,” and “number.”
EmployeeIdValueEmployeeIdType AlternateEmployeeId location role number
1234Contractor v564 USA Tester 786
3564FullTime v743 UK Developer 654
4739Consignment v836 IN Analyst 347
9454W2 v836 CA Modeler 876
484Contractor v564 USA Tester 786
110FullTime v743 UK Developer 654
312Consignment v836 IN Analyst 347
299W2 v836 CA Modeler 876

We are trying to code below query in java and Below query works with AQL.

Select * from GlobalUsers.Employee where AlternateEmployeeId in (‘v564’,v836’);

What i am getting from previous response is that, in Java Aerospike the above can only be achieved by running the following two queries.

Select * from GlobalUsers.Employee where AlternateEmployeeId = ‘v564’ ;

Select * from GlobalUsers.Employee where AlternateEmployeeId = ‘v836’ ;

Please confirm if my understanding is correct.

Thank you so much for your help.

This works in AQL?

Select * from GlobalUsers.Employee where AlternateEmployeeId in (‘v564’,v836’);

not for me… (I have a secondary index on name bin.) My server version is: E-7.0.0.8

Aerospike Query Client
Version 9.0.0
aql> select * from test.testset
+--------+----------+-----+
| PK     | name     | age |
+--------+----------+-----+
| "key0" | "Sandra" | 34  |
| "key2" | "Jill"   | 20  |
| "key4" | "Jim"    | 46  |
| "key1" | "Jack"   | 26  |
| "key8" | "Sam"    | 12  |
| "key3" | "James"  | 38  |
| "key7" | "Sean"   | 24  |
| "key6" | "Sally"  | 32  |
| "key5" | "Julia"  | 62  |
| "key9" | "Susan"  | 42  |
+--------+----------+-----+
10 rows in set (0.035 secs)

OK

aql> select * from test.testset where name="Jill"
+--------+--------+-----+
| PK     | name   | age |
+--------+--------+-----+
| "key2" | "Jill" | 20  |
+--------+--------+-----+
1 row in set (0.007 secs)

OK

aql> select * from test.testset where name in ("Jill")
Unsupported command format with token -  '(' 
Make sure string values are enclosed in quotes.
Type " aql --help " from console or simply "help" from within the aql-prompt. 

aql> select * from test.testset where name in ["Jill"]
Unsupported command format with token -  '[' 
Make sure string values are enclosed in quotes.
Type " aql --help " from console or simply "help" from within the aql-prompt. 

aql> select * from test.testset where name in ("Jill", "Jack")
Unsupported command format with token -  '(' 
Make sure string values are enclosed in quotes.
Type " aql --help " from console or simply "help" from within the aql-prompt. 

aql> select * from test.testset where name in ('Jill', 'Jack');
Unsupported command format with token -  '(' 
Make sure string values are enclosed in quotes.
Type " aql --help " from console or simply "help" from within the aql-prompt. 


aql> 

Big picture wise, AQL is an application written in C that uses the Aerospike C Client underneath. All Aerospike clients interact with the server via a common wire protocol to send commands to the server. Clients and applications via clients / C, Java, C# what have you, cannot do what the server does not support. I tested your construct just to make sure they did not add a new functionality in AQL where they are going through a list of strings and may be firing individual SI queries to the server for each string.

Yes, it works in AQL.Attached is the snippet for reference.

Select * from GlobalUsers.Employee where AlternateEmployeeId in (‘v564’,v836’);

When i am querying, using the SI i don’t get any issues.

That does not look like AQL. AQL is a command line tool. Whatever you are running (using JDBC driver?) is either scanning every record and applying a filter expression or (most likely) firing individual SI query for each string in the list. I have shown you the AQL output above. It cannot parse that construct.

==> Anything connecting to the server via TCP/IP cannot make the server do what the server does not support. You can check the server logs. The log ticker will bump up by the number of SI queries executed. (In my log below, I have executed 13 queries so far, printed every 10 seconds.) check count, run your test once, check again after 10 seconds see if it fired a single SI query or more - equal to your number of strings in the list.

May 01 2024 21:43:00 GMT: INFO (info): (ticker.c:885) {test} si-query: short-basic (0,0,0) long-basic (13,0,0) aggr (0,0,0) udf-bg (0,0,0) ops-bg (0,0,0)

BTW, AlternateEmployeeId is > 15 characters - max allowed bin name size - so your real bin name must be something shorter. Just double checking - you don’t need to share your exact names.

1 Like