Bug 8080 - ctdb client requests fail while the system is under heavy load
Summary: ctdb client requests fail while the system is under heavy load
Status: RESOLVED FIXED
Alias: None
Product: CTDB 2.5.x or older
Classification: Unclassified
Component: ctdb (show other bugs)
Version: unspecified
Hardware: All All
: P5 normal
Target Milestone: ---
Assignee: Michael Adam
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-12 12:19 UTC by David Disseldorp
Modified: 2011-06-23 13:59 UTC (History)
3 users (show)

See Also:


Attachments
client-handle-transient-connection-errors.patch (2.40 KB, patch)
2011-04-12 12:19 UTC, David Disseldorp
no flags Details
ctdb-client-connect-retry.patch (2.45 KB, patch)
2011-06-22 16:46 UTC, David Disseldorp
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Disseldorp 2011-04-12 12:19:56 UTC
Created attachment 6401 [details]
client-handle-transient-connection-errors.patch

The CTDB Linux-HA/Pacemaker resource agent uses the ctdb client utility
to monitor the status of ctdbd on a given node, it does this using the
"ctdb ping" command.

Under heavy load connections to the ctdbd unix domain socket may fail
with EAGAIN, subsequently the error is immediately returned to the
calling shell.

The following changes since commit b4cb9a7163251828a1fd69a6ebb3ee370d7643e8:

  vacuum: close fd leak (2010-08-05 10:43:48 +1000)

are available in the git repository at:
  git://oss.sgi.com/ddiss/ctdb 112b-client-conn-retry-rb1

David Disseldorp (1):
      client: handle transient connection errors

 client/ctdb_client.c |   32 ++++++++++++++++++++++++++++----
 1 files changed, 28 insertions(+), 4 deletions(-)
Comment 1 Michael Adam 2011-05-02 08:40:20 UTC
Hi Ronnie,

this looks reasonable to me. Could you review it?

Cheers - Michael
Comment 2 Michael Adam 2011-06-22 12:47:03 UTC
Hi David,

sorry for the long delay...

I am taking back the bug, since Ronnies seems to
be unaware of bugzilla.samba.org ... :-P

The patch looks reasonable, but a couple of comments after looking over it together with Metze (mostly stylistic / cosmetic):

* could you rename "nsecs" ? (I just sounds like nano secons ...)
  e.g. to seconds
* could you use  "seconds *= 2" or "seconds = seconds * 2" instead of 
  "seconds <<= 1" for clarity?...
* add braces:
  if ((ret == 0) || (errno != EAGAIN)) {
    break;
  }
* maybe spare an indentation by using
  if (seconds > CONNECT_MAXSLEE/2) {
    break; 
  }
* don't throw away the errno information from the debug message
  further down
  (this is not cosmetic!)
* close the socket _after_ the debug statement
  (to log the correct errno)
  (not cosmetic either!)

If you adapt the patch this way, I'll be happy to push it! :-)

Cheers - Michael
Comment 3 David Disseldorp 2011-06-22 15:53:14 UTC
(In reply to comment #2)
> Hi David,
> 
> sorry for the long delay...

No problem, thanks for the review :)

... 
> If you adapt the patch this way, I'll be happy to push it! :-)

I've pushed a commit with the above suggestions (all except the spared indentation) to:

The following changes since commit 14a1bf4fff4ce4a48704d09324e4757fbf043f92:

  Version 1.0.114.2: Start a new series of versions 1.0.114.X based on 1.0.114 (2011-05-20 16:21:04 +0200)

are available in the git repository at:
  git://oss.sgi.com/ddiss/ctdb 114-client-conn-retry-rb2

David Disseldorp (1):
      client: handle transient connection errors

 client/ctdb_client.c |   35 ++++++++++++++++++++++++++++++-----
 1 files changed, 30 insertions(+), 5 deletions(-)
Comment 4 David Disseldorp 2011-06-22 16:46:05 UTC
Created attachment 6609 [details]
ctdb-client-connect-retry.patch
Comment 5 Michael Adam 2011-06-23 13:59:07 UTC
Pushed to master, thanks!

I will add it to the 1.0.114 branch.
I think Ronnie should also add it to the 1.2 branch...