Bug 7300 - Node stays UNHEALTHY cpu usage 95%
Node stays UNHEALTHY cpu usage 95%
Status: RESOLVED INVALID
Product: CTDB 2.5.x or older
Classification: Unclassified
Component: ctdb
unspecified
IA64 Linux
: P3 regression
: ---
Assigned To: Michael Adam
Samba QA Contact
https://answers.launchpad.net/ubuntu/...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-26 12:32 UTC by ITec
Modified: 2016-08-13 09:52 UTC (History)
3 users (show)

See Also:


Attachments
Outputs from ps faxl; top; ctdb status when cpu is 96,5% (35.51 KB, text/plain)
2010-03-31 08:25 UTC, ITec
no flags Details
Outputs from ps faxl; top; ctdb status when cpu is on 0% (34.53 KB, text/plain)
2010-03-31 08:27 UTC, ITec
no flags Details
/var/log/log.ctdb on node 1 (41.96 KB, text/plain)
2010-03-31 08:28 UTC, ITec
no flags Details
/var/log/log.ctdb on node 2 (25.68 KB, text/plain)
2010-03-31 08:29 UTC, ITec
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description ITec 2010-03-26 12:32:06 UTC
After starting ctdbd the node stays unhealthy and cpu usage rises periodicly up to 95%

Versions:
ubuntu 10.04.
ctdb 1.0.108-3ubuntu3
CPU  AMD64

/etc/default/ctdb:
CTDB_RECOVERY_LOCK="/GFS5/.ctdb_recovery_lock"
CTDB_DEBUGLEVEL=ERR

/etc/ctdb/nodes:
192.168.224.221

/etc/ctdb/public_addresses:
192.168.6.245/32 eth1
192.168.1.245/32 eth2

ps faxl:
5 0 11068 1 -2 - 82804 1064 ep_pol Ss ? 2:50 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 11069 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 11070 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
1 0 11071 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR
5 0 11072 11068 20 0 82804 712 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR

/var/log/log.ctdb:
2010/03/26 15:53:59.134494 [11068]: Starting CTDBD as pid : 11068
2010/03/26 15:54:00.144763 [11068]: Freeze priority 1
2010/03/26 15:54:00.144898 [11068]: Freeze priority 2
2010/03/26 15:54:00.144992 [11068]: Freeze priority 3
2010/03/26 15:54:04.163466 [11072]: Taking out recovery lock from recovery daemon
2010/03/26 15:54:04.163668 [11072]: Take the recovery lock
2010/03/26 15:54:04.164430 [11072]: Recovery lock taken successfully
2010/03/26 15:54:04.164560 [11072]: Recovery lock taken successfully by recovery daemon
2010/03/26 15:54:04.164699 [11068]: Freeze priority 1
2010/03/26 15:54:04.164807 [11068]: Freeze priority 2
2010/03/26 15:54:04.164900 [11068]: Freeze priority 3
2010/03/26 15:54:24.553645 [11072]: client/ctdb_client.c:771 control timed out. reqid:46 opcode:70 dstnode:0
2010/03/26 15:54:24.553889 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/26 15:54:24.553908 [11072]: Async operation failed with state 3, opcode:70
2010/03/26 15:54:24.553939 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/26 15:54:24.553977 [11072]: Async wait failed - fail_count=1
2010/03/26 15:54:24.553992 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/26 15:54:24.554019 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/26 15:54:34.166958 [11068]: Event script timed out : startrecovery count : 0 pid : 11083
2010/03/26 15:54:34.167166 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11083
2010/03/26 15:54:34.167217 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/26 15:54:34.170867 [11083]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11083
2010/03/26 15:54:34.207295 [11083]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155434.11083
2010/03/26 15:54:34.208841 [11072]: Dropped orphaned reply control with reqid:46
2010/03/26 15:54:34.209332 [11072]: Taking out recovery lock from recovery daemon
2010/03/26 15:54:34.209353 [11072]: Take the recovery lock
2010/03/26 15:54:34.210009 [11072]: Recovery lock taken successfully
2010/03/26 15:54:34.210113 [11072]: Recovery lock taken successfully by recovery daemon
2010/03/26 15:54:34.210285 [11068]: Freeze priority 1
2010/03/26 15:54:34.210386 [11068]: Freeze priority 2
2010/03/26 15:54:34.210478 [11068]: Freeze priority 3
2010/03/26 15:54:54.554312 [11072]: client/ctdb_client.c:771 control timed out. reqid:67 opcode:70 dstnode:0
2010/03/26 15:54:54.554459 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/26 15:54:54.554477 [11072]: Async operation failed with state 3, opcode:70
2010/03/26 15:54:54.554489 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/26 15:54:54.554535 [11072]: Async wait failed - fail_count=1
2010/03/26 15:54:54.554550 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/26 15:54:54.554562 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/26 15:55:04.211688 [11068]: Event script timed out : startrecovery count : 0 pid : 11099
2010/03/26 15:55:04.211833 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11099
2010/03/26 15:55:04.211861 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/26 15:55:04.215978 [11099]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11099
2010/03/26 15:55:04.251221 [11099]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155504.11099
2010/03/26 15:55:04.254282 [11072]: Dropped orphaned reply control with reqid:67
2010/03/26 15:55:04.254851 [11068]: Banning this node for 300 seconds
2010/03/26 15:55:04.254929 [11072]: Taking out recovery lock from recovery daemon
2010/03/26 15:55:04.254983 [11072]: Take the recovery lock
2010/03/26 15:55:04.255687 [11072]: Recovery lock taken successfully
2010/03/26 15:55:04.255787 [11072]: Recovery lock taken successfully by recovery daemon
2010/03/26 15:55:04.255921 [11068]: Freeze priority 1
2010/03/26 15:55:04.256019 [11068]: Freeze priority 2
2010/03/26 15:55:04.256111 [11068]: Freeze priority 3
2010/03/26 15:55:24.554764 [11072]: client/ctdb_client.c:771 control timed out. reqid:89 opcode:70 dstnode:0
2010/03/26 15:55:24.554898 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed
2010/03/26 15:55:24.554913 [11072]: Async operation failed with state 3, opcode:70
2010/03/26 15:55:24.554930 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit
2010/03/26 15:55:24.555003 [11072]: Async wait failed - fail_count=1
2010/03/26 15:55:24.555019 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed.
2010/03/26 15:55:24.555030 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster
2010/03/26 15:55:34.257258 [11068]: Event script timed out : startrecovery count : 0 pid : 11118
2010/03/26 15:55:34.257403 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11118
2010/03/26 15:55:34.257469 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62)
2010/03/26 15:55:34.261315 [11118]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11118
2010/03/26 15:55:34.296027 [11118]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155534.11118
2010/03/26 15:55:34.297504 [11072]: Dropped orphaned reply control with reqid:89
2010/03/26 15:56:04.262626 [11068]: server/ctdb_recover.c:631 Been in recovery mode for too long. Dropping all IPS

After that cpu usage goes back to 0%
After a while cpu goes again on 95%

When cpu is on 95%:
ctdb status
Number of nodes:1
pnn:0 192.168.224.221 UNHEALTHY (THIS NODE)
Generation:INVALID
Size:1
hash:0 lmaster:0
Recovery mode:RECOVERY (1)
Recovery master:0

When cpu is on 0%:
ctdb status
Number of nodes:1
pnn:0 192.168.224.221 BANNED|UNHEALTHY (THIS NODE)
Generation:INVALID
Size:1
hash:0 lmaster:0
Recovery mode:RECOVERY (1)
Recovery master:0

/GFS5 is a GFS2 filesystem
RHCS is running and quorate

/GFS5/ping_pong/ping_pong /GFS5/ping_pong/test.dta 3
22850 locks/sec

What's wrong?
Comment 1 ITec 2010-03-31 08:25:57 UTC
Created attachment 5576 [details]
Outputs from ps faxl; top; ctdb status when cpu is 96,5%
Comment 2 ITec 2010-03-31 08:27:31 UTC
Created attachment 5577 [details]
Outputs from ps faxl; top; ctdb status when cpu is on 0%
Comment 3 ITec 2010-03-31 08:28:40 UTC
Created attachment 5578 [details]
/var/log/log.ctdb on node 1
Comment 4 ITec 2010-03-31 08:29:15 UTC
Created attachment 5579 [details]
/var/log/log.ctdb on node 2
Comment 5 ITec 2010-03-31 08:30:26 UTC
Problem not solved, yet! :-(

Now I tried to add a second node, to see if it is a problem of the single node configuration. But it is not.
Here are the changes:

/etc/ctdb/nodes:
192.168.224.221
192.168.224.222

/GFS5 is a GFS2 filesystem mounted on both nodes
RHCS is running and quorate

onnode all /etc/init.d/ctdb start

>> NODE: 192.168.224.221 <<
root@192.168.224.221's password:
 * Starting Clustered TDB  ctdb
[: 364: yes: unexpected operator
   ...done.

>> NODE: 192.168.224.222 <<
root@192.168.224.222's password:
 * Starting Clustered TDB  ctdb
[: 364: yes: unexpected operator
   ...done.

Any ideas?
Comment 6 Michael Adam 2010-03-31 08:42:07 UTC
Hi, thanks for your bug report!
Looking...
Comment 7 Michael Adam 2010-04-06 15:33:48 UTC
(In reply to comment #5)
> Problem not solved, yet! :-(
> 
> Now I tried to add a second node, to see if it is a problem
> of the single node configuration. But it is not.
> Here are the changes:
> 
> /etc/ctdb/nodes:
> 192.168.224.221
> 192.168.224.222
> 
> /GFS5 is a GFS2 filesystem mounted on both nodes
> RHCS is running and quorate
> 
> onnode all /etc/init.d/ctdb start
> 
> >> NODE: 192.168.224.221 <<
> root@192.168.224.221's password:
>  * Starting Clustered TDB  ctdb
> [: 364: yes: unexpected operator
>    ...done.
> 
> >> NODE: 192.168.224.222 <<
> root@192.168.224.222's password:
>  * Starting Clustered TDB  ctdb
> [: 364: yes: unexpected operator
>    ...done.
> 
> Any ideas?

I suspect there is some kind of misconfiguration.
Could you please post your complete ctdb configuration,
i.e. /etc/default/ctdb , /etc/ctdb/nodes, /etc/ctdb/public_addresses

Or is the content posted in the bug description all there is?

When adding a new node, the nodes file and the /etc/default/ctdb
file should basically be the same
Note that in order to activate the public addresses file, you
should add a line like "CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses"
to your /etc/default/ctdb file.

The init script seems to produce an error.
Could you please attach it to this bug report?

It is generally good to set up password-less
ssh logins between the nodes.

In the log it can be seen that sending the ctdb control 70
(the start recovery control) times out over and over again.
It would be useful to see the logs at higher debug
level, say with "CTDB_DEBUGLEVEL=ERR".

It should suffice to start only one node at first
(even with a two-node config). This one node should
get healty.

One other idea: You might try without the
CTDB_RECOVERY_LOCK setting in the config file:
Newer versions of ctdb can start without a recovery
lock file on clustered storage. It will warn about
running without split branin preventions, but it should
work. This could eliminate the GFS2 file system as a
possible source of trouble.

You could also try to get hold of a newer version 1.0.112 
of ctdb and try that instead of the 1.0.108.
Both versions are rather stable though.

Cheers - Michael
Comment 8 Martin Schwenke 2016-08-13 09:52:33 UTC
Impossible to tell what's happening here, especially without knowing configuration.  One interesting thing is that 01.reclock has never implemented the "startrecovery" event, so it should be a no-op that just loads the configuration and does nothing.

We'll never find out, after all these years.  Marking as invalid because we don't have enough information to determine if this was a real bug.