Since upgrading to anything later than 3.0.94 we’ve been seeing a lot of
Unexpected data received from socket after a write: fd=5 size=130802
(where the size may differ slightly, but is typically in the order of 128K)
errors while trying to write records. We’ve traced this down to code being added in commit Merge aerospike and citrusleaf layers into a single unified client wi… · aerospike/aerospike-client-c@c9ad474 · GitHub that checks for extra data. The error occurs in this scenario:
- we have a set with a secondary integer index and some data bins (8 in total). Initially, the value of the secondary index for all records is 0
- we insert records (doesn’t matter whether it’s 1000 or 1000000)
- we run a query (
as_query_foreach
) with one condition:as_query_where(&query, "in_use", as_integer_equals(0));
the callback method just needs one record, so the callback is hard-coded to always return false, andas_query_foreach
has a userdata field which is a pointer to an std::string:
static bool candidateRecordCallBack(const as_val *value, void *userData) {
assert(userData);
if (value) {
as_record *rec = as_record_fromval(value);
if (!rec) {
return false;
}
char *strVal = as_record_get_str(rec, "key");
if (strVal == 0) {
return false;
}
std::string &str = *(static_cast<std::string *>(userData));
str = strVal;
}
return false;
}
- [Just for the record, if we store about 1000 records, the callback is called about 3 times per
as_query_foreach
call, if we store about 100,000 records, we get about 5 callbacks; the last of these is always called with a NULL value pointer, as expected.] - Now, after
as_query_foreach
returns, the callback has returned us a key. We use this to retrieve & update that record (in the same set) and destroy the query usingas_query_destroy
- Next, we write a few records to other sets, and then we get the error mentioned above, however which
aerospike_key_put
fails differs between invocations. - We have dumped the buffer that is claimed to be received erroneously in
as_command.c
and it seems to contain records from the query intended for query callback (i.e. we see record data). This is unexpected, since the query callback has finished and we’ve destroyed the query usingas_query_destroy
.
Is it possible that the query is somehow not properly finished on the server and that the server keeps sending data? There does not seem to be a way to check whether a query is fully finished and/or to flush any remaining data (I expect I wouldn’t have to do that?)
Notes:
- the error occurs only on ‘remote’ system, i.e. this does not happen if we run Aerospike on localhost (single node cluster)
- the error occurs more often if there are mode nodes in a cluster, and if the machines or network are slower
- for our specific scenario, versions 3.0.96 through roughly 3.1.2 hang; anything after that exhibits the error listed above