Scan Query : com.aerospike.client.AerospikeException: Error Code 4: Parameter error

Hi I am using aerospike-3.10.1.1 with the client version 4.0.8 I am running into issues with scanning a large table. The job seems to succeed with lower volumes of data. The scan is done via a query (With no predicates)

18/02/20 23:53:57 ERROR ApplicationMaster: User class threw exception: com.aerospike.client.AerospikeException: Error Code 4: Parameter error
com.aerospike.client.AerospikeException: Error Code 4: Parameter error
	at com.aerospike.client.command.MultiCommand.parseGroup(MultiCommand.java:89)
	at com.aerospike.client.command.MultiCommand.parseResult(MultiCommand.java:64)
	at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:82)
	at com.aerospike.client.query.QueryExecutor$QueryThread.run(QueryExecutor.java:141)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

My code snippet:

def queryAeroSpike(asClient: IAerospikeClient, namespace:String, setName: String): Iterator[KeyRecord] = {

try {
  //Query Policy
  val queryPolicy = new QueryPolicy()
  queryPolicy.priority = Priority.LOW
  queryPolicy.includeBinData = true
  queryPolicy.maxRetries = 3
  queryPolicy.sleepBetweenRetries = 10

  //Query Statement
  val st = new Statement()
  st.setNamespace(namespace)
  st.setSetName(setName)


  asClient.query(queryPolicy, st).iterator().asScala

} catch {
  case e: Exception => {
    LogHelper.logger.error("exception : ", e)
    Iterator[KeyRecord]()
  }
} finally {
  aerospikeStore.getClient.close()
}

}

Calling code:

try {
  val recordSetIterator = queryAeroSpike(asClient, config.namespace, config.setName)
  while (recordSetIterator.hasNext) {
   //Some record processing
  }
} catch {
  case e: Exception => {
    LogHelper.logger.error("exception caller : ", e)
     }
     }
  finally {
    asClient.close()
    }

The exception is encountered in the calling code while iterating the recordSet (after processing a couple of million records). I see similar issues on : Parameter Error when making Aerospike Query with multiple equal filters · Issue #33 · aerospike/aerospike-client-java · GitHub But there are no filters in my query (and no or clauses) Any clues as to what may be going on here?

Trying to troubleshoot more, at the server side, I do see the following invoked just before my scan is failed:

Feb 21 2018 01:44:59 GMT: WARNING (scan): (scan.c:346) send error - fd 801 sz 734976 Connection timed out
Feb 21 2018 01:44:59 GMT: INFO (scan): (scan.c:686) finished basic scan job 13295360087278674846 (-2)

https://github.com/aerospike/aerospike-server/blob/hotfix/3.10.1/as/src/base/scan.c#L343 but no idea on what could be causing this.

Comments:

  1. Never set maxRetries > 0 on scans/queries. If a retry occurs, you will receive duplicate records if initial failure occurs during receive.

  2. If one of the query/scan nodes is blocked on the client (not retrieving/processing records fast enough), the query/scan could timeout. This server timeout is a socket idle timeout meaning the client hasn’t asked for more records within the timeout period.

It’s also possible that the server itself was blocked in some way running this query/scan.

Thanks for the reply Brian. Good point about the maxRetries. Regarding 2, I am still a little unclear- I went over Understanding Timeout and Retry policies And my understanding of the document says, that the socketTimeOut and the totalTimeOut are client side properties. I set both of these to 0 (i.e. infinite). This should mean that the issues we are seeing are related to server timeouts (the hypothesis of not getting requests from the client) - Any idea on how to control this timeout? I took a look at Configuration Reference | Aerospike Documentation and the only relevant config parameter I could see was transaction-max-ms But this is not it, because according to documentation: “Transactions taking longer than this time (or the time specified in the client polity) will return a timeout and tick the client_tsvc_timeout metric.” I do not see this in the server logs.

Any pointers in troubleshooting more will be very helpful.

In addition, the server applies a 10 second socket idle timeout on query/scan if the client policy socketTimeout is not overridden.

The 10 seconds - is that the timeout on the socket between the client and the server? I am wondering if there is a timeout that governs the actual time taken to fetch a record within aerospike (server side). My hypothesis is, that the timeout between the client and the server can be configured (which is what I had set to infinite), but there must be a timeout within aerospike which governs how long to wait to fetch the record and which is what we may be hitting.

The 10 seconds is the socket idle timeout. If the server takes over 10 seconds to fetch the next block of records, the socket would be idle for that 10 seconds, so the timeout will apply. The client must be actively reading and the server must be actively sending within the 10 second interval to avoid a timeout.

If you configure the client socketTimeout to a value greater than 10000ms, the server will respect that value. The only socketTimeout value that the server overrides is zero.

That’s good to know, I reset my policy to look like:

  val queryPolicy = new QueryPolicy()
  queryPolicy.priority = Priority.MEDIUM
  queryPolicy.includeBinData = true
  queryPolicy.setTimeout(60000)

Any I get the same error. It is fairly deterministic in the sense that the timeouts occur after about scanning 40Mn records - with set 0 as well as 60 seconds which I imagine is a fairly big timeout. Makes me think it could be a data issue? Perhaps there’s a row thats too big? (which is still weird because I am only scanning a single bin, which is an id).

For large scans, avoid setting totalTimeout or using setTimeout(). You should only set socketTimeout directly for large scans.

  val queryPolicy = new QueryPolicy()
  queryPolicy.priority = Priority.MEDIUM
  queryPolicy.includeBinData = true
  queryPolicy.socketTimeout = 60000;

The most common reasons for socket timeouts on query/scan are:

  1. A network failure between client/server has occurred. This doesn’t seem to be applicable to your case.

  2. The client stops requesting more data because of some bottleneck in your application code.

I suggest trying your query and just iterating through the results and not performing other processing.

  val recordSetIterator = queryAeroSpike(asClient, config.namespace, config.setName)
  while (recordSetIterator.hasNext()) {
       recordSetIterator.next();
       // do not do anything else here.
  }

Thanks Brian, I think your analysis may have been spot on. I removed all processing and now its a completely different issue… I am scanning a single column on around 100mn rows at low priority - I let the job run, but it ran for over 24 hours without completing (didn’t fail but I manually killed it).

show scans

+----------------+-----------+----------------------+--------------+---------------+----------+----------+--------+------------------------------------+-----------+----------+--------------+-----------------+
| ns             | mem-usage | trid                 | job-progress | set           | priority | job-type | module | status                             | recs-read | run-time | net-io-bytes | time-since-done |
+----------------+-----------+----------------------+--------------+---------------+----------+----------+--------+------------------------------------+-----------+----------+--------------+-----------------+
| "xxxxx" | 0         | 3516092068537687670  | "0.59"       | "xxxxxxx" | 1        | "basic"  | "scan" | "done(abandoned-response-timeout)" | 38522     | 15209    | 1048712      | 80585129        |
| "xxxxx" | 0         | -5816913134390555011 | "78.15"      | "xxxxxxx" | 2        | "basic"  | "scan" | "done(abandoned-response-timeout)" | 4288657   | 551992   | 545598040    | 84847470        |
| "xxxxx" | 0         | 2075446121899768948  | "77.98"      | "xxxxxxx" | 2        | "basic"  | "scan" | "done(abandoned-response-timeout)" | 4262447   | 548500   | 541676344    | 85427077        |
| "xxxxx" | 0         | 447538121829824692   | "3.47"       | "xxxxxxx" | 1        | "basic"  | "scan" | "done(abandoned-response-error)"   | 238854    | 27831    | 27362416     | 86812879        |
| "xxxxx" | 0         | 9047329268901944071  | "78.15"      | "xxxxxxx" | 1        | "basic"  | "scan" | "done(abandoned-response-timeout)" | 4294519   | 832019   | 546645984    | 86868106        |
| "xxxxx" | 0         | -1635748149807116231 | "1.29"       | "xxxxxxx" | 1        | "basic"  | "scan" | "done(abandoned-response-error)"   | 95051     | 9174     | 8259656      | 87788097        |
| "xxxxx" | 0         | 5163961504785540734  | "76.98"      | "xxxxxxx" | 1        | "basic"  | "scan" | "done(abandoned-response-timeout)" | 4231802   | 553846   | 538370584    | 87826005        |
| "xxxxx" | 0         | -9195202362902802994 | "79.59"      | "xxxxxxx" | 1        | "basic"  | "scan" | "done(abandoned-response-timeout)" | 4361853   | 692287   | 554127296    | 149847744       |

Result set is truncated for brevity. I also maintain a running count on how many records that were scanned. this count at the time of termination was orders of magnitude larger than the number of rows.

Any idea on why a scan would not complete? Should I be attributing the larger counts to retries?

Low priority scans can get starved out if higher priority scans are constantly running. Retries can definitely result in duplicate rows.

There are no other scan jobs, but perhaps other read-heavy jobs could cause the same effects. Thanks for your help Brian - definitely picked up a bunch of things in this thread. I will investigate more.