Background job reports "done(ok)" while some records failed

fschaeffler · January 19, 2021, 3:08pm

During the execution of a background UDF job, we had the following errors in our server log.

Jan 18 2021 13:26:28 GMT: DEBUG (socket): (socket.c:1044) Error while sending on FD 127: 11 (Resource temporarily unavailable)
Jan 18 2021 13:26:55 GMT: WARNING (udf): (udf.c:1206) lua-error: result "UDF Execution Timeout"

After the background job has finished, an show scans in aql reports back the following.

aql> show scans
+----------------+-------------+------------+-----------------+-------------+--------------+----------------+--------------------+--------------------+--------------+-------------------------------------+----------+------------------+--------+----------------+--------------------+------------+----------+--------------+----------+-----------------+----------------+--------------------+
| active-threads | ns          | udf-active | udf-filename    | recs-failed | udf-function | recs-succeeded | recs-filtered-bins | trid               | job-progress | set                                 | priority | job-type         | module | recs-throttled | recs-filtered-meta | status     | run-time | net-io-bytes | rps      | time-since-done | socket-timeout | from               |
+----------------+-------------+------------+-----------------+-------------+--------------+----------------+--------------------+--------------------+--------------+-------------------------------------+----------+------------------+--------+----------------+--------------------+------------+----------+--------------+----------+-----------------+----------------+--------------------+
| "0"            | "ssd-store" | "0"        | "remove-record" | "584"       | "run"        | "325263"       | "0"                | "2005483622367821" | "100.00"     | "visit-duration"                    | "0"      | "background-udf" | "scan" | "325847"       | "0"                | "done(ok)" | "201885" | "30"         | "100000" | "1478265"       | "10000"        | "172.21.0.1:57338" |
+----------------+-------------+------------+-----------------+-------------+--------------+----------------+--------------------+--------------------+--------------+-------------------------------------+----------+------------------+--------+----------------+--------------------+------------+----------+--------------+----------+-----------------+----------------+--------------------+

This overview shows that the specific background job has failed (recs-failed) for 584 records. However, the status is marked as "done(ok)".

Unfortunately, the JobInfo response for this background job doesn’t contain any field for recs-failed.

Is there any possibility - from within the Node.js client - to check if a background-job has fully or only partially succeeded?

Jan · January 20, 2021, 4:27am

The Node.js client’s Job class does not currently return this information, though that would be a nice feature addition! That said, you can issue the same info command that the job and also aql use, to get the details. Below is a quick example that shows how to do that, including how to easily parse the results. But note that you will still have to aggregate the results from the individual nodes in your cluster.

const as = require('./')

as.connect().then(async (client) => {
  try {
    // 1. start a background query and wait until it's done
    const query = client.query('test', 'test')
    const job = await query.background('test', 'noop')
    await job.waitUntilDone()

    // 2. retrieve job info from all nodes
    const infoCmd = `jobs:module=${job.module};cmd=get-job;trid=${job.jobID}`
    const jobInfo = await client.infoAll(infoCmd)

    // 3. parse the results - note: as.info module does not yet know the format of the jobs response, so we have to tell it which separators to use when parsing the data
    as.info.separators['jobs:*'] = [':=']
    jobInfo.forEach((nodeInfo) =>
      console.log(as.info.parse(nodeInfo.info))
    )
  } catch (error) {
    console.error(error)
  } finally {
    client.close()
  }
}).catch(console.error)

Sample output:

{
  'jobs:module=scan;cmd=get-job;trid=48028841706515': {
    trid: 48028841706515,
    'job-type': 'background-udf',
    ns: 'test',
    set: 'test',
    priority: 0,
    rps: 10000,
    'active-threads': 0,
    status: 'done(ok)',
    'job-progress': '100.00',
    'run-time': 39,
    'time-since-done': 978,
    'recs-throttled': 291,
    'recs-filtered-meta': 0,
    'recs-filtered-bins': 0,
    'recs-succeeded': 291,
    'recs-failed': 0,
    'net-io-bytes': 30,
    'socket-timeout': 10000,
    from: '172.17.0.1+58752',
    'udf-filename': 'udf',
    'udf-function': 'updateRecord',
    'udf-active': 0
  }
}
{
  'jobs:module=scan;cmd=get-job;trid=48028841706515': {
    trid: 48028841706515,
    'job-type': 'background-udf',
    ns: 'test',
    set: 'test',
    priority: 0,
    rps: 10000,
    'active-threads': 0,
    status: 'done(ok)',
    'job-progress': '100.00',
    'run-time': 51,
    'time-since-done': 967,
    'recs-throttled': 364,
    'recs-filtered-meta': 0,
    'recs-filtered-bins': 0,
    'recs-succeeded': 364,
    'recs-failed': 0,
    'net-io-bytes': 30,
    'socket-timeout': 10000,
    from: '172.17.0.1+55094',
    'udf-filename': 'udf',
    'udf-function': 'updateRecord',
    'udf-active': 0
  }
}
{
  'jobs:module=scan;cmd=get-job;trid=48028841706515': {
    trid: 48028841706515,
    'job-type': 'background-udf',
    ns: 'test',
    set: 'test',
    priority: 0,
    rps: 10000,
    'active-threads': 0,
    status: 'done(ok)',
    'job-progress': '100.00',
    'run-time': 50,
    'time-since-done': 968,
    'recs-throttled': 346,
    'recs-filtered-meta': 0,
    'recs-filtered-bins': 0,
    'recs-succeeded': 346,
    'recs-failed': 0,
    'net-io-bytes': 30,
    'socket-timeout': 10000,
    from: '172.17.0.1+56856',
    'udf-filename': 'udf',
    'udf-function': 'updateRecord',
    'udf-active': 0
  }
}

Please refer to the info command reference for more details about the jobs command.

fschaeffler · January 20, 2021, 8:41am

To be honest, this approach for finding out if a background job succeeded was quite unexpected. Nonetheless, it’s now giving us different errors.

Your advised solution for us looks like the following one.

const infoCommand = `jobs:module=${job.module};cmd=get-job;trid=${job.jobID}`
const jobInfos = await client.infoAll(infoCommand)

Aerospike.info.separators['jobs:*'] = [':=']

const noOfFailedRecords = jobInfos
	.map(jobInfo => Aerospike.info.parse(jobInfo)[infoCommand]['recs-failed'])
	.reduce((memory, failures) => memory + failures, 0)

Unfortunately, that command gives us an error for every single execution. Those errors are as follows.

~ 90% of the cases: Jan 20 2021 08:11:15 GMT: ERROR(13660) [command.cc:84] [ErrorCallback] - Client error in InfoForeach command: Info request exceeds max. length (1803363888) [-2]
~ 10% of the cases: AerospikeError: job not found

The documentation for infoAll states that the request can be empty and then will return all available information.

When triggering the infoAll with no request parameter we get back lots of information, which doesn’t include any job information.

When triggering the infoAll like e.g. await client.infoAll('jobs'), only the list of nodes is returned.

[ { host: { node_id: 'BB9030012AC4202' } } ]

fschaeffler · January 20, 2021, 9:21am

I gave it another try via asinfo for one specific job asinfo -v 'jobs:module=query;cmd=get-job;trid=8357818281184793'. As a result, we’re getting ERROR:2:job not found while show scans via aql lists the job for the trid=8357818281184793

aql> show scans
+----------------+-------------+------------+-----------------+-------------+--------------+----------------+--------------------+--------------------+--------------+------------------+----------+------------------+--------+----------------+--------------------+------------+----------+--------------+----------+-----------------+----------------+--------------------+
| active-threads | ns          | udf-active | udf-filename    | recs-failed | udf-function | recs-succeeded | recs-filtered-bins | trid               | job-progress | set              | priority | job-type         | module | recs-throttled | recs-filtered-meta | status     | run-time | net-io-bytes | rps      | time-since-done | socket-timeout | from               |
+----------------+-------------+------------+-----------------+-------------+--------------+----------------+--------------------+--------------------+--------------+------------------+----------+------------------+--------+----------------+--------------------+------------+----------+--------------+----------+-----------------+----------------+--------------------+
| "0"            | "ssd-store" | "0"        | "remove-record" | "0"         | "run"        | "0"            | "0"                | "8357818281184793" | "100.00"     | "visit-duration" | "0"      | "background-udf" | "scan" | "0"            | "0"                | "done(ok)" | "23"     | "30"         | "100000" | "426111"        | "10000"        | "172.18.0.1:55448" |
+----------------+-------------+------------+-----------------+-------------+--------------+----------------+--------------------+--------------------+--------------+------------------+----------+------------------+--------+----------------+--------------------+------------+----------+--------------+----------+-----------------+----------------+--------------------+
[127.0.0.1:3000] 1 rows in set (0.013 secs)

OK

fschaeffler · January 20, 2021, 9:25am

The documentation for Job lists this as wait. Is wait or waitForDone the correct one?

fschaeffler · January 20, 2021, 10:34am

We now have a fully functional check for the amount of failed records of a background UDF job. Unfortunately, the Aerospike’s info parser and node info command weren’t giving us any results. Thus, we have created a function that gives us correct results, although being a bit slower.

const getNoOfFailedRecords = async ({ client, job }) => {
	try {
		const jobId = job.jobID.toString()

		const jobInfos = (await client.infoAll('jobs:*'))
			.map((jobInfo) => jobInfo.info.split('\t')[1].split(';'))
			.reduce((memory, entries) => memory.concat(entries), [])
			.map((jobInfo) =>
				jobInfo.split(':').map((keyValue) => {
					const [key, value] = keyValue.split('=')
					return { key, value }
				})
			)
			.map((jobInfo) =>
				jobInfo.reduce((memory, { key, value }) => ({
					...memory,
					[key]: value
				}))
			)

		const noOfFailedRecords = jobInfos
			.filter((jobInfo) => jobInfo.trid === jobId)
			.reduce((memory, jobInfo) => memory + parseInt(jobInfo['recs-failed'], 10), 0)

		return noOfFailedRecords
	} catch (e) {
		debug(`parsing of job info failed for job %i`, job.jobID)
		throw e
	}
}

Jan · January 20, 2021, 11:00am

To be honest, this approach for finding out if a background job succeeded was quite unexpected.

That’s totally understandable. The client actually uses the same jobs info query under the hood, but it only checks the job progress. Would be nice if you could access the full job info via the Job instance as well. Care to file a feature request at GitHub - aerospike/aerospike-client-nodejs: Node.js client for the Aerospike database?

Unfortunately, that command gives us an error for every single execution. Those errors are as follows.

~ 90% of the cases: Jan 20 2021 08:11:15 GMT: ERROR(13660) [command.cc:84] [ErrorCallback] - Client error in InfoForeach command: Info request exceeds max. length (1803363888) [-2]

~ 10% of the cases: AerospikeError: job not found

I’ve occasionally encountered the “job not found” error, but have yet to find out what’s causing that. I have not encountered the “info request exceeds max. length” error myself. That sounds like a bug! Not sure why you are running into it though, but not me. What client version are you using?

The documentation for infoAll states that the request can be empty and then will return all available information.

When triggering the infoAll with no request parameter we get back lots of information, which doesn’t include any job information .

[…]

When triggering the infoAll like e.g. await client.infoAll('jobs') , only the list of nodes is returned.

The documentation is not quite accurate. An info command without query does not return all information. The output of certain info commands which require extra parameters – such as the jobs command, which requires at least the module parameters – is not included by default.

You should be able to run jobs:module=scan and jobs:module=query commands and get info about all current scans/queries, though.

I gave it another try via asinfo for one specific job asinfo -v 'jobs:module=query;cmd=get-job;trid=8357818281184793' . As a result, we’re getting ERROR:2:job not found while show scans via aql lists the job for the trid=8357818281184793

aql’s show scans just queries jobs:module=scan as well.

The documentation for Job lists this as wait . Is wait or waitForDone the correct one?

My bad. Used to be Job#waitUntilDone in an older client version (v2.x?), but has since been renamed to just Job#wait. But the old function name still exists as an alias.

Jan · January 20, 2021, 11:01am

Nice!

client.infoAll(‘jobs:*’)

Ah, didn’t know that works! I learned something new today.

Topic		Replies	Views
Scan stuck: is it dead or still running?	4	1982	September 30, 2015
How bypass timeout with udf in scan? User Defined Functions (UDF) query , scan , udf , stream	3	1161	October 12, 2019
Help debugging Error: code: 100, message: 'UDF: Execution Error 1 User Defined Functions (UDF)	4	4673	June 12, 2015
Timeout error with AQL	2	1697	November 11, 2015
Execution Error 2 and Failed Logging User Defined Functions (UDF)	3	2329	April 27, 2016

Background job reports "done(ok)" while some records failed

Related topics