Secondary index missing records, can it be repaired?

secondary
index

#1

We have a situation where we recently upgraded a 3 node cluster from 3.4.1 to 3.5.9 and since the upgrade, secondary index queries sometimes fail to find records that we know are available by the primary index.

The stat index <ns> <indexname> command shows that the state=RW and sync_state=synced, so everything seems to be fine, but querying still comes back empty sometimes.

The rate of query misses is very low(less than 3%), so we don’t think that a whole node has a corrupted secondary index, but we aren’t sure how to recover the missing records.

I am considering a repair index, but I am not sure it would help, and I would like to know the ramifications.

Would a repair fix the problem? Would the node be fully available? How would it respond to queries on the secondary index while it is rebuilding? Would it take as long as a full secondary index rebuild(on that node)?

Thanks, hubert


Missing records when query with secondary index
#2

Hi hubt,

Repair index will only work if sync_state is set to not synced. Else it is a no-op. If it is not sync, then repair is similar to drop_index and then create index.

Can you share the details of the records which are missing. Also can you share the result of following command ?

  1. aql> stat index <namespace_name>
  2. asinfo -v “statistics”

Thanks


#3

There’s not much we know about why the records go missing. We have them in a separate database, but when we cross-check them, they just don’t appear by secondary index. But we can find them by primary key.

aql> stat index d2 d2-b-hmac;
+--------------------------------+------------+
| name                           | value      |
+--------------------------------+------------+
| "keys"                         | 57575918   |
| "objects"                      | 57591214   |
| "ibtr_memory_used"             | 2606334288 |
| "nbtr_memory_used"             | 1785358844 |
| "si_accounted_memory"          | 4391693132 |
| "load_pct"                     | 100        |
| "loadtime"                     | 15         |
| "stat_write_reqs"              | 65718849   |
| "stat_write_success"           | 65718849   |
| "stat_write_errs"              | 0          |
| "stat_delete_reqs"             | 8120472    |
| "stat_delete_success"          | 8120472    |
| "stat_delete_errs"             | 0          |
| "stat_gc_recs"                 | 8120472    |
| "stat_gc_time"                 | 27836      |
| "query_reqs"                   | 4505691    |
| "query_avg_rec_count"          | 0          |
| "query_avg_record_size"        | 372        |
| "query_agg"                    | 0          |
| "query_agg_avg_rec_count"      | 0          |
| "query_agg_avg_record_size"    | 0          |
| "query_lookups"                | 4505691    |
| "query_lookup_avg_rec_count"   | 0          |
| "query_lookup_avg_record_size" | 372        |
| "gc-period"                    | 1000       |
| "gc-max-units"                 | 1000       |
| "data-max-memory"              | -1         |
| "tracing"                      | 0          |
| "histogram"                    | "false"    |
| "ignore-not-sync"              | "true"     |
+--------------------------------+------------+
30 rows in set (0.000 secs)
+--------------------------------+------------+
| name                           | value      |
+--------------------------------+------------+
| "keys"                         | 57796684   |
| "objects"                      | 57811974   |
| "ibtr_memory_used"             | 2611923088 |
| "nbtr_memory_used"             | 1792189990 |
| "si_accounted_memory"          | 4404113078 |
| "load_pct"                     | 100        |
| "loadtime"                     | 0          |
| "stat_write_reqs"              | 57819323   |
| "stat_write_success"           | 57819323   |
| "stat_write_errs"              | 0          |
| "stat_delete_reqs"             | 0          |
| "stat_delete_success"          | 0          |
| "stat_delete_errs"             | 0          |
| "stat_gc_recs"                 | 0          |
| "stat_gc_time"                 | 0          |
| "query_reqs"                   | 4205175    |
| "query_avg_rec_count"          | 0          |
| "query_avg_record_size"        | 372        |
| "query_agg"                    | 0          |
| "query_agg_avg_rec_count"      | 0          |
| "query_agg_avg_record_size"    | 0          |
| "query_lookups"                | 4205175    |
| "query_lookup_avg_rec_count"   | 0          |
| "query_lookup_avg_record_size" | 372        |
| "gc-period"                    | 1000       |
| "gc-max-units"                 | 1000       |
| "data-max-memory"              | -1         |
| "tracing"                      | 0          |
| "histogram"                    | "false"    |
| "ignore-not-sync"              | "true"     |
+--------------------------------+------------+
30 rows in set (0.001 secs)
+--------------------------------+------------+
| name                           | value      |
+--------------------------------+------------+
| "keys"                         | 58302164   |
| "objects"                      | 58317638   |
| "ibtr_memory_used"             | 2569452192 |
| "nbtr_memory_used"             | 1807867750 |
| "si_accounted_memory"          | 4377319942 |
| "load_pct"                     | 100        |
| "loadtime"                     | 0          |
| "stat_write_reqs"              | 86386427   |
| "stat_write_success"           | 86386427   |
| "stat_write_errs"              | 0          |
| "stat_delete_reqs"             | 25329767   |
| "stat_delete_success"          | 25329736   |
| "stat_delete_errs"             | 31         |
| "stat_gc_recs"                 | 25329384   |
| "stat_gc_time"                 | 126175     |
| "query_reqs"                   | 11287195   |
| "query_avg_rec_count"          | 0          |
| "query_avg_record_size"        | 374        |
| "query_agg"                    | 0          |
| "query_agg_avg_rec_count"      | 0          |
| "query_agg_avg_record_size"    | 0          |
| "query_lookups"                | 11287195   |
| "query_lookup_avg_rec_count"   | 0          |
| "query_lookup_avg_record_size" | 374        |
| "gc-period"                    | 1000       |
| "gc-max-units"                 | 1000       |
| "data-max-memory"              | -1         |
| "tracing"                      | 0          |
| "histogram"                    | "false"    |
| "ignore-not-sync"              | "true"     |
+--------------------------------+------------+
30 rows in set (0.000 secs)

asinfo -v "statistics"
cluster_size=3;cluster_key=5A1C3D978F451FD5;cluster_integrity=true;objects=252583394;sub-records=92;total-bytes-disk=1073741824000;used-bytes-disk=105891048320;free-pct-disk=90;total-bytes-memory=60129542144;used-bytes-memory=25329259920;data-used-bytes-memory=0;index-used-bytes-memory=16165343104;sindex-used-bytes-memory=9163916816;free-pct-memory=57;stat_read_reqs=956114896;stat_read_reqs_xdr=0;stat_read_success=941335096;stat_read_errs_notfound=14779800;stat_read_errs_other=0;stat_write_reqs=626821537;stat_write_reqs_xdr=0;stat_write_success=524093552;stat_write_errs=102668727;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=57762391;stat_rw_timeout=59261;udf_read_reqs=358625840;udf_read_success=358547300;udf_read_errs_other=78540;udf_write_reqs=136308786;udf_write_success=136308786;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=1;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=13286;stat_proxy_reqs_xdr=0;stat_proxy_success=1103;stat_proxy_errs=0;stat_ldt_proxy=0;stat_cluster_key_trans_to_proxy_retry=0;stat_cluster_key_transaction_reenqueue=0;stat_slow_trans_queue_push=274730153;stat_slow_trans_queue_pop=274730153;stat_slow_trans_queue_batch_pop=82433;stat_cluster_key_regular_processed=0;stat_cluster_key_prole_retry=0;stat_cluster_key_err_ack_dup_trans_reenqueue=0;stat_cluster_key_partition_transaction_queue_count=0;stat_cluster_key_err_ack_rw_trans_reenqueue=1;stat_expired_objects=1369501;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=2576539;err_tsvc_requests=102711807;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=135;err_rw_pending_limit=88;err_rw_cant_put_unique=0;fabric_msgs_sent=1841538608;fabric_msgs_rcvd=1841538201;paxos_principal=BB97D5B97817A06;migrate_msgs_sent=771302903;migrate_msgs_recv=771313325;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=3138;migrate_num_incoming_refused=0;queue=0;transactions=1623644228;reaped_fds=23783;tscan_initiate=10;tscan_pending=0;tscan_succeeded=10;tscan_aborted=0;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=13286;proxy_action=58695;proxy_retry=4;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=626821537;write_prole=489684205;read_dup_prole=85904444;rw_err_dup_internal=0;rw_err_dup_cluster_key=91559;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=59206;rw_err_ack_badnode=0;client_connections=308;waiting_transactions=0;tree_count=0;record_refs=252612702;record_locks=2;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=1;err_storage_queue_full=0;partition_actual=1396;partition_replica=1354;partition_desync=0;partition_absent=1346;partition_object_count=252583394;partition_ref_count=4097;system_free_mem_pct=51;sindex_ucgarbage_found=0;sindex_gc_locktimedout=3736;sindex_gc_inactivity_dur=1402025985;sindex_gc_activity_dur=5888015;sindex_gc_list_creation_time=5640487;sindex_gc_list_deletion_time=244382;sindex_gc_objects_validated=1406416783;sindex_gc_garbage_found=51177087;sindex_gc_garbage_cleaned=51177087;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=102627936;err_write_fail_generation=193403;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=9200;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=0;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=13514512;uptime=1460824;stat_write_errs_notfound=5;stat_write_errs_other=102668722;heartbeat_received_self=0;heartbeat_received_foreign=45683653;query_reqs=20651482;query_success=20651476;query_fail=84966;query_abort=1;query_avg_rec_count=0;query_short_queue_size=0;query_long_queue_size=0;query_short_queue_full=84961;query_long_queue_full=0;query_short_running=20566503;query_long_running=13;query_tracked=3833;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=20651477;query_lookup_success=20651476;query_lookup_err=0;query_lookup_abort=1;query_lookup_avg_rec_count=0

#4

Are there other options for us to safely rebuild the index? For instance, we could restart each node successively and then see if the secondary index is rebuilt.

Would that take a while because it could force many migrations(we have a replication factor of 2 in a cluster of 3)? Or would it cancel migrations when the server came back up? Would secondary index queries be satisfied by the other nodes while the restarted node is rebuilding its own secondary index? Or would they fail?

A drop and recreate index would have a pretty long rebuild time without secondary indexes, so I’m reluctant to do that.

Thanks, hubert


#5

Hi Hubert,

Sorry for the late reply.

Dropping and recreating index is better option than restarting the whole cluster. But we recently have found a bug in the system through which secondary index query results might miss some records from primary index. Can you can send the output of following command to us ?

aql > qnode

This output will help us to verify that you are hitting this bug or not.

Thanks