Python call to scan.foreach is hanging


#1

I have stripped my code down to a very simple call to scan.foreach with a callback function that just prints out one line of information. It’s never getting called. It appears that the call simply hangs. No exception is thrown and control doesn’t pass to the next line.

Python 2.7.10, Aerospike client 1.0.49, just under 3 billion records in a 9 node cluster.


#2

Please open a new issue with the GitHub repo. Add the simple script you’re using, and the output from AQL for show sets and show scans as you run the script and it hangs.


#3

It’s a proprietary application, so I can’t really show the namespaces and sets. When I run ‘show scans’ in aql it looks like the scan is running. I just get no output.

Here’s the stripped down code:

#!/usr/bin/env python

from __future__ import print_function
import sys
import aerospike

def connect(host,port):
    """ Connect to Aerospike. """
    config = { 'hosts': [(host,port)] }
    client = None

    try:
        client = aerospike.client(config).connect()
    except:
        print("Failed to connect to Aerospike.")
        client = None

    return client

def check_record((key,metadata,record)):
    write(".")
    sys.stdout.flush()

def scan_records(host,port,namespace,set):
    """
    Connect to the specified Aerospike host and port and begin scanning.
    """
    client = connect(host,port)
    if (client is not None):
        print("Connected to Aerospike.")

        print("Configuring scan.")
        scan = client.scan(namespace,set)
        print("Starting scan.")
        scan.foreach(check_record)
        print("Scan complete.")

        client.close()

if __name__ == "__main__":
    if len(sys.argv) == 5:
        scan_records(sys.argv[1],int(sys.argv[2]),sys.argv[3],sys.argv[4])
    else:
        print("Usage:  "
              + sys.argv[0]
              + "<aerospike-host> <aerospike-port> <namespace> <set>")

    sys.exit(0)

Do you see anything obviously wrong?

Thanks.


#4

Nothing is wrong with your Python code. I loaded 16M records into a set, and when I run it I get the expected output:

$ python p.py 192.168.119.3 3000 test demo
connected
True
Connected to Aerospike.
Configuring scan.
Starting scan.
....................................................................................................................................................................................................................................................................................... etc
Scan complete.

I assume something is up with your server config. Do you mind pasting your aerospike.conf here? I assume you’ve checked that it’s the same on all nodes.


#5

Sorry for the delay in replying, managed a couple of days off. Here’s the output of asinfo get-config:

transaction-queues=24;transaction-threads-per-queue=3;transaction-duplicate-threads=0;transaction-pending-limit=20;migrate-threads=1;migrate-xmit-priority=40;migrate-xmit-sleep=500;migrate-read-priority=10;migrate-read-sleep=500;migrate-xmit-hwm=10;migrate-xmit-lwm=5;migrate-max-num-incoming=256;migrate-rx-lifetime-ms=60000;proto-fd-max=15000;proto-fd-idle-ms=60000;proto-slow-netio-sleep-ms=1;transaction-retry-ms=1000;transaction-max-ms=1000;transaction-repeatable-read=false;dump-message-above-size=134217728;ticker-interval=10;microbenchmarks=false;storage-benchmarks=false;ldt-benchmarks=false;scan-priority=200;scan-sleep=1;batch-threads=4;batch-max-requests=5000;batch-priority=200;nsup-delete-sleep=100;nsup-period=120;nsup-startup-evict=true;paxos-retransmit-period=5;paxos-single-replica-limit=1;paxos-max-cluster-size=32;paxos-protocol=v3;paxos-recovery-policy=manual;write-duplicate-resolution-disable=false;respond-client-on-master-completion=false;replication-fire-and-forget=false;info-threads=16;allow-inline-transactions=true;use-queue-per-device=false;snub-nodes=false;fb-health-msg-per-burst=0;fb-health-msg-timeout=200;fb-health-good-pct=50;fb-health-bad-pct=0;auto-dun=false;auto-undun=false;prole-extra-ttl=0;max-msgs-per-type=-1;service-threads=24;fabric-workers=16;pidfile=/var/run/aerospike/asd.pid;memory-accounting=false;udf-runtime-gmax-memory=18446744073709551615;udf-runtime-max-memory=18446744073709551615;sindex-populator-scan-priority=3;sindex-data-max-memory=18446744073709551615;query-threads=6;query-worker-threads=15;query-priority=10;query-in-transaction-thread=0;query-req-in-query-thread=0;query-req-max-inflight=100;query-bufpool-size=256;query-batch-size=100;query-sleep=1;query-job-tracking=false;query-short-q-max-size=500;query-long-q-max-size=500;query-rec-count-bound=4294967295;query-threshold=10;query-untracked-time=1000000;service-address=0.0.0.0;service-port=3000;mesh-seed-address-port=xxx.yyy.zzz.40:3002;mesh-seed-address-port=xxx.yyy.zzz.41:3002;mesh-seed-address-port=xxx.yyy.zzz.42:3002;mesh-seed-address-port=xxx.yyy.zzz.43:3002;mesh-seed-address-port=xxx.yyy.zzz.44:3002;mesh-seed-address-port=xxx.yyy.zzz.45:3002;reuse-address=true;fabric-port=3001;fabric-keepalive-enabled=true;fabric-keepalive-time=1;fabric-keepalive-intvl=1;fabric-keepalive-probes=10;network-info-port=3003;enable-fastpath=true;heartbeat-mode=mesh;heartbeat-protocol=v2;heartbeat-address=xxx.yyy.zzz.40;heartbeat-port=3002;heartbeat-interval=150;heartbeat-timeout=20;enable-security=false;privilege-refresh-period=300;report-authentication-sinks=0;report-data-op-sinks=0;report-sys-admin-sinks=0;report-user-admin-sinks=0;report-violation-sinks=0;syslog-local=-1;enable-xdr=false;forward-xdr-writes=false;xdr-delete-shipping-enabled=true;xdr-nsup-deletes-enabled=false;stop-writes-noxdr=false;reads-hist-track-back=1800;reads-hist-track-slice=10;reads-hist-track-thresholds=1,8,64;writes_master-hist-track-back=1800;writes_master-hist-track-slice=10;writes_master-hist-track-thresholds=1,8,64;proxy-hist-track-back=1800;proxy-hist-track-slice=10;proxy-hist-track-thresholds=1,8,64;writes_reply-hist-track-back=1800;writes_reply-hist-track-slice=10;writes_reply-hist-track-thresholds=1,8,64;udf-hist-track-back=1800;udf-hist-track-slice=10;udf-hist-track-thresholds=1,8,64;query-hist-track-back=1800;query-hist-track-slice=10;query-hist-track-thresholds=1,8,64;query_rec_count-hist-track-back=1800;query_rec_count-hist-track-slice=10;query_rec_count-hist-track-thresholds=1,8,64

And here's the config for the specific namespace:

memory-size=216895848448;high-water-disk-pct=50;high-water-memory-pct=60;evict-tenths-pct=5;stop-writes-pct=90;cold-start-evict-ttl=4294967295;repl-factor=2;default-ttl=0;max-ttl=0;conflict-resolution-policy=generation;allow_versions=false;single-bin=false;ldt-enabled=false;ldt-page-size=8192;enable-xdr=false;sets-enable-xdr=true;ns-forward-xdr-writes=false;allow-nonxdr-writes=true;allow-xdr-writes=true;disallow-null-setname=false;total-bytes-memory=216895848448;read-consistency-level-override=off;write-commit-level-override=off;total-bytes-disk=2184292859904;defrag-lwm-pct=50;defrag-queue-min=0;defrag-sleep=1000;defrag-startup-minimum=10;flush-max-ms=1000;fsync-max-sec=0;write-smoothing-period=0;max-write-cache=67108864;min-avail-pct=5;post-write-queue=256;data-in-memory=false;dev=/dev/sdb1;dev=/dev/sdc1;dev=/dev/sdd1;dev=/dev/sde1;dev=/dev/sdf1;dev=/dev/sdg1;filesize=17179869184;writethreads=1;writecache=67108864;obj-size-hist-max=100

Anything look unusual?

Thanks again,

Patrick


#6

You never showed me the output of show scans but my guess is that your new scan is waiting in line behind another long-running or frozen scan. One of the main features of the new Aerospike server CE 3.6.0 release is interlacing scans and better scan tuning (AER-2986). This means that scans will all get progress in parallel.

Please upgrade to the new server and try your script again.


#7

Thanks again for the reply. Running ‘show scans’ in aql just returns “OK”. It doesn’t look like there are any running.

I’ll try to get the server upgraded, but that’s not a quick process in a production environment. Is there anything else I can do to find the root cause?


#8

Two obvious things would be (a) please post your redacted config file and (b) Aerospike offers enterprise support.


#9

I’m suggesting server 3.6.1 at this point. If this is still an issue after the upgrade, let me know.


#10

We finally upgraded our cluster from 3.5.14 to 3.7.3 and I got the opportunity to try this again. With exactly the same code as above, I know get:

Connected to Aerospike.
Configuring scan.
Starting scan.
Traceback (most recent call last):
  File "./scan-test.py", line 42, in <module>
    scan_records(sys.argv[1],int(sys.argv[2]),sys.argv[3],sys.argv[4])
  File "./scan-test.py", line 35, in scan_records
    scan.foreach(check_record)
exception.ClientError: (-1L, 'Callback function raised an exception', 'src/main/scan/foreach.c', 161)

Could the problem be that we have six nodes containing over 4 billion records?