Unbalanced master objects distribution across nodes


#1

We are experiencing issue where one of our nodes get twice as much “master objects” as the others in the cluster.

Distribution of partitions among nodes seems to be fine. At least according to: asmonitor -e "pmap [namespace]" it seems that all nodes got more less equal number of partitions.

Then it could be that one of the partitions got significantly more objects than the others, but I don’t know how to prove it. Is it possible to print out information about distribution of objects among partitions?

Even if it seems to be quite likely scenario there are two things which concern me:

  1. replica object seems to be well balanced across nodes, and if one partition would be significantly bigger then I suspect that replica objects on one of the nodes would be also significantly bigger than on the other.
  2. what would have to happen to the key values to cause such huge disproportion in partitions size?

Do you have any idea how to track down this issue?


#2

Yes, if a partition has significantly more records the rest, then there must be a replica partition with the same characteristic.

The first 12 bits of the digest determine the partition ID, if you were to manually create digests with the same leading 12 bits then they would all be in the same partition.

  1. What I suspect has happened here is there you have a node that is master for twice as many partitions than the rest.

Could you please run: asadm -e "show stat namespace like object"


#3

I’ve used asmonitor pmap to verify that scenario and it turned out that partitions are more less well balanced:

265 272 250 245 262 263 253 248 259 249 263 267 244 263 253 235

what starts to worry me is that pmap started to report something like this, which I think he wasn’t doing it yesterday:

153 Did not get partition
1916 Did not get partition
1928 Did not get partition
2055 Did not get partition
2422 Did not get partition

do you know what does that mean?

Yesterday I’ve started to play with: asinfo -v “partition-info” and if we consider that third column from the right is number of records in a given partition then it looks like we have partitions which has 60% more records than the average on the one edge and 20% less on the other. Whether there is 200 partitions which has a least 40% records more.


#4

Two more things:

  1. digest is generated by Java client API (we do not generate it on our own)
  2. we have far more than 100M records

#5

Could you provide the output of:

you may need to sanitize IP addresses and Namespace names

asadm -e "info"

I’m not sure what this would indicate, I also see the same problem locally for all partitions. This could be a bug in asmonitor, this also uses an info command that has been obsolete for over a year now.

Asmonitor is in the process of being replaced by asadm, there currently isn’t an implementation of pmap in asadm.