Random celery worker segfault


#1

Hey guys, please check the details below

The problem

While running some celery tasks they randomly fail with

[2017-02-23 22:55:04,802: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).

faulthandler gives this traceback

    Thread 0x00007f9c7895c700 (most recent call first):
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/connection.py", line 437 in _recv
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/connection.py", line 462 in _recv_bytes
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/connection.py", line 248 in recv_bytes
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/queues.py", line 355 in get_payload
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 422 in _recv
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 461 in receive
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 359 in workloop
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 290 in __call__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 112 in run
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 306 in _bootstrap
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/popen_fork.py", line 79 in _launch
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/popen_fork.py", line 24 in __init__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/context.py", line 333 in _Popen
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 122 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1117 in _create_worker_process
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 439 in _create_worker_process
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1008 in __init__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 422 in __init__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/prefork.py", line 112 in on_start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/base.py", line 131 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bootsteps.py", line 370 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bootsteps.py", line 119 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/worker/worker.py", line 203 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/worker.py", line 256 in run
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/base.py", line 244 in __call__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/worker.py", line 221 in run_from_argv
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 412 in execute
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 480 in handle_argv
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/base.py", line 281 in execute_from_commandline
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 488 in execute_from_commandline
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 326 in main
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/__main__.py", line 14 in main
  File "/home/nike/.virtualenvs/lithedata/bin/celery", line 11 in <module>
Fatal Python error: Segmentation fault

When I set celery concurrency to 1 the issue never happens. A few tasks out of 10k always fails with concurrency=4.

How to reproduce

Installed packages

aerospike==2.0.6
amazon-kclpy==1.4.3
amqp==2.1.4
appdirs==1.4.0
argh==0.26.2
awscli==1.11.41
Babel==2.3.4
beautifulsoup4==4.5.3
billiard==3.5.0.2
boto==2.45.0
botocore==1.5.4
celery==4.0.2
colorama==0.3.7
cookies==2.2.1
coverage==4.3.4
docutils==0.13.1
flower==0.9.1
inotify==0.2.8
jmespath==0.9.0
kombu==4.0.2
mock==2.0.0
nltk==3.2.2
packaging==16.8
pathtools==0.1.2
pbr==1.10.0
psutil==5.1.3
pyasn1==0.1.9
pycurl==7.43.0
pyparsing==2.1.10
python-dateutil==2.6.0
pytz==2016.10
PyYAML==3.12
redis==2.10.5
requests==2.13.0
rsa==3.4.2
s3transfer==0.1.10
six==1.10.0
tornado==4.2
vine==1.1.3
watchdog==0.8.3

The failing task reduced to absolute minimum

@app.task(ignore_result=True, bind=True)
def aerospike_crash_task(self):
    config = {
        'hosts': [('127.0.0.1', 3000)],
        'policies': {
            'key': aerospike.POLICY_KEY_SEND
        }
    }
    aerospike_client = aerospike.client(config).connect()
    key = ('recommendations', 'setname', 1)
    aerospike_client.put(key, {'bin': 'value'})
    aerospike_client.close()

The script to set 10k tasks

if __name__ == "__main__":
    for i in range(1, 10000):
        aerospike_crash_task.delay()

My celery command celery worker -A ld.queue -E --logfile=/var/log/celery/celery.log -Ofair --concurrency=4

Extra details

When I comment out the aerospike code the tasks executes fine. I tried the same with redis client

@app.task(ignore_result=True, bind=True)
def aerospike_crash_task(self, i):
    r = redis.StrictRedis(host='localhost', port=6379, db=0)
    r.set('foo', 'bar')

And it works just fine. The os has enough sockets/fd, I’m using settings from http://www.linuxbrigade.com/reduce-time_wait-socket-connections/ to reuse connections. Typical numbers while running the test task are

while true; do netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n; sleep 1; done

  1 established)
  1 Foreign
 40 LISTEN
 43 FIN_WAIT1
 87 ESTABLISHED
601 FIN_WAIT2
645 CLOSE_WAIT
1206 TIME_WAIT

If I comment out the “put” command the issue is still there so it should be something with connection pools

Any clue on this one? How can I help to debug it further? It’s on Ubuntu 16.04

Thanks!


#2

Will probably help the devs to have the stacktrace of the component that faulted. Look into https://docs.python.org/3/library/faulthandler.html.

Disregard the following, these are processes, not threads :slight_smile:.

BTW, the Aerospike client is thread safe so you should only need one per server. Does celery support sharing the client object per machine?


#3

Also, is this running on a single machine?

What is the max file descriptors (ulimit -n) configured to?


#4

Thanks for your quick reply. Everything is running on a single local machine. Ulimit -n gives 4096.

Sorry, do you need something else besides the trace I provided in the original message? This slightly modified code is better because then it appears in the backtrace

def aerospike_connect():
    config = {
        'hosts': [('127.0.0.1', 3000)],
        'policies': {
            'key': aerospike.POLICY_KEY_SEND
        }
    }
    aerospike_client = aerospike.client(config).connect()

    key = ('recommendations', 'setname', 1)
    aerospike_client.put(key, {'bin': 'value'})
    aerospike_client.close()

@app.task(ignore_result=True, bind=True)
def aerospike_crash_task(self, i):
    aerospike_connect()

Basically the same backtrace but pointing to aerospike code

Fatal Python error: Segmentation fault

Thread 0x00007f7889df0700 (most recent call first):
  File "/home/nike/projects/lithedata/ld/queue/tasks.py", line 132 in aerospike_connect
  File "/home/nike/projects/lithedata/ld/queue/tasks.py", line 136 in aerospike_crash_task
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/app/trace.py", line 622 in __protected_call__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/app/trace.py", line 367 in trace_task
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/app/trace.py", line 530 in _fast_trace_task
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 359 in workloop
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 290 in __call__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 112 in run
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 306 in _bootstrap
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/popen_fork.py", line 79 in _launch
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/popen_fork.py", line 24 in __init__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/context.py", line 333 in _Popen
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/process.py", line 122 in start
  File "/home/nike/.virtualenvs/lithedata/lib/pytld.queue.tasks.aerospike_crash_task.aerospike_crash_task:140: 5169
hon3.5/site-packages/billiard/pool.py", line 1117 in _create_worker_process
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 439 in _create_worker_process
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/billiard/pool.py", line 1008 in __init__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/asynpool.py", line 422 in __init__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/prefork.py", line 112 in on_start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/concurrency/base.py", line 131 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bootsteps.py", line 370 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bootsteps.py", line 119 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/worker/worker.py", line 203 in start
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/worker.py", line 256 in run
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/base.py", line 244 in __call__
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/worker.py", line 221 in run_from_argv
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 412 in execute
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 480 in handle_argv
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/base.py", line 281 in execute_from_commandline
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 488 in execute_from_commandline
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/bin/celery.py", line 326 in main
  File "/home/nike/.virtualenvs/lithedata/lib/python3.5/site-packages/celery/__main__.py", line 14 in main
  File "/home/nike/.virtualenvs/lithedata/bin/celery", line 11 in <module>

#5

I was after the line numbers in the client code itself which is written in C. Though seems I was mistaken on what the faulthandler module provided.

I have messaged the developer for this module.

It was mentioned in this conversation that calling connect twice in a process caused a segfault. Not sure if this has been addressed so it may be related.