Version: 4.9.6 (Sernet Samba)
Over the weekend, one of our CTDB clusters has started logging the following error excessively on two (out of four) nodes. This caused the OS disk to start filling up at a rapid pace.
2019/11/25 09:22:09.332557 ctdbd: High hopcount 1434489998 dbid:locking.tdb key:0x0ea718c8 reqid=5c33ffa0 pnn:1 src:1 lmaster:0 header->dmaster:0 dst:0
After forcing a CTDB recovery (`ctdb recover`), the issue stopped.
Can you attach relevant logs from all the nodes when this message started happening?
Created attachment 15650 [details]
CTDB node logs (4 Nodes)
I've attached the CTDB logs from the nodes in the cluster.
As the log file consumed all available storage space on node 01 and node 02, I had to trim the log file, so there is no logging available on these nodes before the issue started.
Sorry to take so long to get back to you...
Are there any logs available from when this issue started? Or that what you've provided? in the rare cases where we see this type of issue, the hop counts usually start off much lower.
I'm afraid that I no longer have logging from the time of the incident, only the CTDB logs provided previously. Could you advise what the error means/relates to?
TL;DR That quantity of that message means we have a bug...
In a CTDB distributed database each record key has an associated "location master" node, which always knows which node has the record. When a node is asked for a record, if it doesn't have the data (i.e. is not the "data master") then it asks the location master to send it the record. If the record exists but the location master doesn't have it then it asks the data master. It can happen that the data master changes while a request is in flight, due to other requests occurring at the same time.
So CTDB chases records around the cluster until it finds them. The number of nodes traversed in the chase is called the "hop count". This number is usually small. Even when there is extreme contention for a record we rarely see a hop count in the 100s.
You're seeing ridiculously high hop counts, which indicates that CTDB has lost track of a record. The location for the record has been corrupted so this is a bug. We haven't seen a bug like this for many years.
To track down the bug, we need to know what happened when the problem began. For example, we need to know there was a database recovery or a vacuuming (or garbage collection) run when it began.
If it isn't a recovery bug, but we can't find the root cause, then we might be able to mitigate the effect of the bug by limiting placing a limit on the hop count and either banning a node or causing a recovery.
Obviously we would like to solve the underlying bug. If this happens again and you have a chance to grab the logs from the start of the issue then that would help us know where to look...
Thanks for the detailed info, much appreciated.
The cluster in question has been known to be under a lot of memory pressure, due to a previous bug with a clustered file system server. There is a fair chance that the system has 100% res memory usage and 100% swap usage. It was very common for processes to out of memory and get killed off on those nodes. This happened very regularly, around a couple of times per week.
I wouldn't be surprised if that occurred and caused something like this.
Unfortunately by the time we found the issue, the system had ground to a halt due to there being no OS disk space remaining, so I decided to truncate the logs to get the system functioning again. I'll write something into our monitoring system to get this error picked up sooner, so we can recover the whole file for further analysis.
CTDB doesn't require ACKs for record migrations, so it would be possible for this to happen if ctdbd crashes on a node. This should cause a database recovery, which should fix the problem.
The worse case and likely case in this scenario is where a node is unable to allocate memory to dequeue a packet that happens to contain a migration response, so the packet is dropped.
Yesterday Amitay & I discussed the possibility of requiring ACKs for record migrations to avoid this type of issue. However, we need to be careful to make sure we don't destroy performance, so this seems like a longer term consideration.
I'll put this back into "NEEDINFO" state in case you manage to recreate the problem.
Thanks for looking into this issue.
I've not seen this issue occurring on any of our other clusters over the last few years, even though some of them suffer the same OOM conditions & memory leak bug.
I'll update this ticket if I see it again.