Hey guys, please check the details below
The problem
While running some celery tasks they randomly fail with
[2017-02-23 22:55:04,802: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
faulthandler gives this traceback
Thread 0x00007f9c7895c700 (most recent call first):
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/connection.py", line 437 in _recv
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/connection.py", line 462 in _recv_bytes
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/connection.py", line 248 in recv_bytes
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/queues.py", line 355 in get_payload
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 422 in _recv
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 461 in receive
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 359 in workloop
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 290 in __call__
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 112 in run
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 306 in _bootstrap
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/popen_fork.py", line 79 in _launch
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/popen_fork.py", line 24 in __init__
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/context.py", line 333 in _Popen
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 122 in start
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1117 in _create_worker_process
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 439 in _create_worker_process
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1008 in __init__
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 422 in __init__
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/prefork.py", line 112 in on_start
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/base.py", line 131 in start
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bootsteps.py", line 370 in start
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bootsteps.py", line 119 in start
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/worker/worker.py", line 203 in start
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/worker.py", line 256 in run
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/base.py", line 244 in __call__
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/worker.py", line 221 in run_from_argv
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 412 in execute
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 480 in handle_argv
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/base.py", line 281 in execute_from_commandline
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 488 in execute_from_commandline
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 326 in main
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/__main__.py", line 14 in main
File "/home/nike/.virtualenvs/lithedata/bin/celery", line 11 in <module>
Fatal Python error: Segmentation fault
When I set celery concurrency to 1 the issue never happens. A few tasks out of 10k always fails with concurrency=4.
How to reproduce
Installed packages
aerospike==2.0.6
amazon-kclpy==1.4.3
amqp==2.1.4
appdirs==1.4.0
argh==0.26.2
awscli==1.11.41
Babel==2.3.4
beautifulsoup4==4.5.3
billiard==3.5.0.2
boto==2.45.0
botocore==1.5.4
celery==4.0.2
colorama==0.3.7
cookies==2.2.1
coverage==4.3.4
docutils==0.13.1
flower==0.9.1
inotify==0.2.8
jmespath==0.9.0
kombu==4.0.2
mock==2.0.0
nltk==3.2.2
packaging==16.8
pathtools==0.1.2
pbr==1.10.0
psutil==5.1.3
pyasn1==0.1.9
pycurl==7.43.0
pyparsing==2.1.10
python-dateutil==2.6.0
pytz==2016.10
PyYAML==3.12
redis==2.10.5
requests==2.13.0
rsa==3.4.2
s3transfer==0.1.10
six==1.10.0
tornado==4.2
vine==1.1.3
watchdog==0.8.3
The failing task reduced to absolute minimum
@app.task(ignore_result=True, bind=True)
def aerospike_crash_task(self):
config = {
'hosts': [('127.0.0.1', 3000)],
'policies': {
'key': aerospike.POLICY_KEY_SEND
}
}
aerospike_client = aerospike.client(config).connect()
key = ('recommendations', 'setname', 1)
aerospike_client.put(key, {'bin': 'value'})
aerospike_client.close()
The script to set 10k tasks
if __name__ == "__main__":
for i in range(1, 10000):
aerospike_crash_task.delay()
My celery command
celery worker -A ld.queue -E --logfile=/var/log/celery/celery.log -Ofair --concurrency=4
Extra details
When I comment out the aerospike code the tasks executes fine. I tried the same with redis client
@app.task(ignore_result=True, bind=True)
def aerospike_crash_task(self, i):
r = redis.StrictRedis(host='localhost', port=6379, db=0)
r.set('foo', 'bar')
And it works just fine. The os has enough sockets/fd, I’m using settings from http://www.linuxbrigade.com/reduce-time_wait-socket-connections/ to reuse connections. Typical numbers while running the test task are
while true; do netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n; sleep 1; done
1 established)
1 Foreign
40 LISTEN
43 FIN_WAIT1
87 ESTABLISHED
601 FIN_WAIT2
645 CLOSE_WAIT
1206 TIME_WAIT
If I comment out the “put” command the issue is still there so it should be something with connection pools
Any clue on this one? How can I help to debug it further? It’s on Ubuntu 16.04
Thanks!