After starting ctdbd the node stays unhealthy and cpu usage rises periodicly up to 95% Versions: ubuntu 10.04. ctdb 1.0.108-3ubuntu3 CPU AMD64 /etc/default/ctdb: CTDB_RECOVERY_LOCK="/GFS5/.ctdb_recovery_lock" CTDB_DEBUGLEVEL=ERR /etc/ctdb/nodes: 192.168.224.221 /etc/ctdb/public_addresses: 192.168.6.245/32 eth1 192.168.1.245/32 eth2 ps faxl: 5 0 11068 1 -2 - 82804 1064 ep_pol Ss ? 2:50 /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR 1 0 11069 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR 1 0 11070 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR 1 0 11071 11068 -2 - 82804 236 hrtime S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR 5 0 11072 11068 20 0 82804 712 ep_pol S ? 0:00 \_ /usr/sbin/ctdbd --reclock=/GFS5/.ctdb_recovery_lock -d ERR /var/log/log.ctdb: 2010/03/26 15:53:59.134494 [11068]: Starting CTDBD as pid : 11068 2010/03/26 15:54:00.144763 [11068]: Freeze priority 1 2010/03/26 15:54:00.144898 [11068]: Freeze priority 2 2010/03/26 15:54:00.144992 [11068]: Freeze priority 3 2010/03/26 15:54:04.163466 [11072]: Taking out recovery lock from recovery daemon 2010/03/26 15:54:04.163668 [11072]: Take the recovery lock 2010/03/26 15:54:04.164430 [11072]: Recovery lock taken successfully 2010/03/26 15:54:04.164560 [11072]: Recovery lock taken successfully by recovery daemon 2010/03/26 15:54:04.164699 [11068]: Freeze priority 1 2010/03/26 15:54:04.164807 [11068]: Freeze priority 2 2010/03/26 15:54:04.164900 [11068]: Freeze priority 3 2010/03/26 15:54:24.553645 [11072]: client/ctdb_client.c:771 control timed out. reqid:46 opcode:70 dstnode:0 2010/03/26 15:54:24.553889 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed 2010/03/26 15:54:24.553908 [11072]: Async operation failed with state 3, opcode:70 2010/03/26 15:54:24.553939 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit 2010/03/26 15:54:24.553977 [11072]: Async wait failed - fail_count=1 2010/03/26 15:54:24.553992 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed. 2010/03/26 15:54:24.554019 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster 2010/03/26 15:54:34.166958 [11068]: Event script timed out : startrecovery count : 0 pid : 11083 2010/03/26 15:54:34.167166 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11083 2010/03/26 15:54:34.167217 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62) 2010/03/26 15:54:34.170867 [11083]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11083 2010/03/26 15:54:34.207295 [11083]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155434.11083 2010/03/26 15:54:34.208841 [11072]: Dropped orphaned reply control with reqid:46 2010/03/26 15:54:34.209332 [11072]: Taking out recovery lock from recovery daemon 2010/03/26 15:54:34.209353 [11072]: Take the recovery lock 2010/03/26 15:54:34.210009 [11072]: Recovery lock taken successfully 2010/03/26 15:54:34.210113 [11072]: Recovery lock taken successfully by recovery daemon 2010/03/26 15:54:34.210285 [11068]: Freeze priority 1 2010/03/26 15:54:34.210386 [11068]: Freeze priority 2 2010/03/26 15:54:34.210478 [11068]: Freeze priority 3 2010/03/26 15:54:54.554312 [11072]: client/ctdb_client.c:771 control timed out. reqid:67 opcode:70 dstnode:0 2010/03/26 15:54:54.554459 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed 2010/03/26 15:54:54.554477 [11072]: Async operation failed with state 3, opcode:70 2010/03/26 15:54:54.554489 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit 2010/03/26 15:54:54.554535 [11072]: Async wait failed - fail_count=1 2010/03/26 15:54:54.554550 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed. 2010/03/26 15:54:54.554562 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster 2010/03/26 15:55:04.211688 [11068]: Event script timed out : startrecovery count : 0 pid : 11099 2010/03/26 15:55:04.211833 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11099 2010/03/26 15:55:04.211861 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62) 2010/03/26 15:55:04.215978 [11099]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11099 2010/03/26 15:55:04.251221 [11099]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155504.11099 2010/03/26 15:55:04.254282 [11072]: Dropped orphaned reply control with reqid:67 2010/03/26 15:55:04.254851 [11068]: Banning this node for 300 seconds 2010/03/26 15:55:04.254929 [11072]: Taking out recovery lock from recovery daemon 2010/03/26 15:55:04.254983 [11072]: Take the recovery lock 2010/03/26 15:55:04.255687 [11072]: Recovery lock taken successfully 2010/03/26 15:55:04.255787 [11072]: Recovery lock taken successfully by recovery daemon 2010/03/26 15:55:04.255921 [11068]: Freeze priority 1 2010/03/26 15:55:04.256019 [11068]: Freeze priority 2 2010/03/26 15:55:04.256111 [11068]: Freeze priority 3 2010/03/26 15:55:24.554764 [11072]: client/ctdb_client.c:771 control timed out. reqid:89 opcode:70 dstnode:0 2010/03/26 15:55:24.554898 [11072]: client/ctdb_client.c:882 ctdb_control_recv failed 2010/03/26 15:55:24.554913 [11072]: Async operation failed with state 3, opcode:70 2010/03/26 15:55:24.554930 [11072]: server/ctdb_recoverd.c:178 Node 0 failed the startrecovery event. Setting it as recovery fail culprit 2010/03/26 15:55:24.555003 [11072]: Async wait failed - fail_count=1 2010/03/26 15:55:24.555019 [11072]: server/ctdb_recoverd.c:202 Unable to run the 'startrecovery' event. Recovery failed. 2010/03/26 15:55:24.555030 [11072]: server/ctdb_recoverd.c:1372 Unable to run the 'startrecovery' event on cluster 2010/03/26 15:55:34.257258 [11068]: Event script timed out : startrecovery count : 0 pid : 11118 2010/03/26 15:55:34.257403 [11068]: server/eventscript.c:508 Sending SIGTERM to child pid:11118 2010/03/26 15:55:34.257469 [11068]: server/ctdb_recover.c:997 startrecovery event script failed (status -62) 2010/03/26 15:55:34.261315 [11118]: Timed out running script '/etc/ctdb/events.d/01.reclock startrecovery ' after 30.0 seconds pid :11118 2010/03/26 15:55:34.296027 [11118]: Logged timedout eventscript : { pstree -p; cat /proc/locks; ls -li /var/ctdb/ /var/ctdb/persistent; } >/tmp/ctdb.event.20100326155534.11118 2010/03/26 15:55:34.297504 [11072]: Dropped orphaned reply control with reqid:89 2010/03/26 15:56:04.262626 [11068]: server/ctdb_recover.c:631 Been in recovery mode for too long. Dropping all IPS After that cpu usage goes back to 0% After a while cpu goes again on 95% When cpu is on 95%: ctdb status Number of nodes:1 pnn:0 192.168.224.221 UNHEALTHY (THIS NODE) Generation:INVALID Size:1 hash:0 lmaster:0 Recovery mode:RECOVERY (1) Recovery master:0 When cpu is on 0%: ctdb status Number of nodes:1 pnn:0 192.168.224.221 BANNED|UNHEALTHY (THIS NODE) Generation:INVALID Size:1 hash:0 lmaster:0 Recovery mode:RECOVERY (1) Recovery master:0 /GFS5 is a GFS2 filesystem RHCS is running and quorate /GFS5/ping_pong/ping_pong /GFS5/ping_pong/test.dta 3 22850 locks/sec What's wrong?
Created attachment 5576 [details] Outputs from ps faxl; top; ctdb status when cpu is 96,5%
Created attachment 5577 [details] Outputs from ps faxl; top; ctdb status when cpu is on 0%
Created attachment 5578 [details] /var/log/log.ctdb on node 1
Created attachment 5579 [details] /var/log/log.ctdb on node 2
Problem not solved, yet! :-( Now I tried to add a second node, to see if it is a problem of the single node configuration. But it is not. Here are the changes: /etc/ctdb/nodes: 192.168.224.221 192.168.224.222 /GFS5 is a GFS2 filesystem mounted on both nodes RHCS is running and quorate onnode all /etc/init.d/ctdb start >> NODE: 192.168.224.221 << root@192.168.224.221's password: * Starting Clustered TDB ctdb [: 364: yes: unexpected operator ...done. >> NODE: 192.168.224.222 << root@192.168.224.222's password: * Starting Clustered TDB ctdb [: 364: yes: unexpected operator ...done. Any ideas?
Hi, thanks for your bug report! Looking...
(In reply to comment #5) > Problem not solved, yet! :-( > > Now I tried to add a second node, to see if it is a problem > of the single node configuration. But it is not. > Here are the changes: > > /etc/ctdb/nodes: > 192.168.224.221 > 192.168.224.222 > > /GFS5 is a GFS2 filesystem mounted on both nodes > RHCS is running and quorate > > onnode all /etc/init.d/ctdb start > > >> NODE: 192.168.224.221 << > root@192.168.224.221's password: > * Starting Clustered TDB ctdb > [: 364: yes: unexpected operator > ...done. > > >> NODE: 192.168.224.222 << > root@192.168.224.222's password: > * Starting Clustered TDB ctdb > [: 364: yes: unexpected operator > ...done. > > Any ideas? I suspect there is some kind of misconfiguration. Could you please post your complete ctdb configuration, i.e. /etc/default/ctdb , /etc/ctdb/nodes, /etc/ctdb/public_addresses Or is the content posted in the bug description all there is? When adding a new node, the nodes file and the /etc/default/ctdb file should basically be the same Note that in order to activate the public addresses file, you should add a line like "CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses" to your /etc/default/ctdb file. The init script seems to produce an error. Could you please attach it to this bug report? It is generally good to set up password-less ssh logins between the nodes. In the log it can be seen that sending the ctdb control 70 (the start recovery control) times out over and over again. It would be useful to see the logs at higher debug level, say with "CTDB_DEBUGLEVEL=ERR". It should suffice to start only one node at first (even with a two-node config). This one node should get healty. One other idea: You might try without the CTDB_RECOVERY_LOCK setting in the config file: Newer versions of ctdb can start without a recovery lock file on clustered storage. It will warn about running without split branin preventions, but it should work. This could eliminate the GFS2 file system as a possible source of trouble. You could also try to get hold of a newer version 1.0.112 of ctdb and try that instead of the 1.0.108. Both versions are rather stable though. Cheers - Michael
Impossible to tell what's happening here, especially without knowing configuration. One interesting thing is that 01.reclock has never implemented the "startrecovery" event, so it should be a no-op that just loads the configuration and does nothing. We'll never find out, after all these years. Marking as invalid because we don't have enough information to determine if this was a real bug.