8319 – Intermittent "Invalid packet of length 0" errors in recoverd

Bug 8319 - Intermittent "Invalid packet of length 0" errors in recoverd

Summary: Intermittent "Invalid packet of length 0" errors in recoverd

Status:	RESOLVED FIXED

Alias:	None

Product:	CTDB 2.5.x or older
Classification:	Unclassified
Component:	ctdb (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P5 normal
Target Milestone:	---
Assignee:	Michael Adam
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-07-26 13:43 UTC by David Disseldorp
Modified:	2011-08-05 08:39 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
ctdb log with dumped buffers for bad packets (412.92 KB, text/plain) 2011-07-26 13:45 UTC, David Disseldorp	no flags	Details
Patch to make queue_io_read() safe for reentry (5.27 KB, patch) 2011-07-31 02:08 UTC, David Disseldorp	no flags	Details
Patch to make queue_io_read() safe for reentry (5.54 KB, patch) 2011-07-31 13:09 UTC, David Disseldorp	no flags	Details
smbclient workload used to trip the bug (1.13 KB, application/x-shellscript) 2011-08-01 12:09 UTC, David Disseldorp	no flags	Details
Patch to make queue_io_read() safe for reentry - rb2 (5.60 KB, patch) 2011-08-01 15:43 UTC, David Disseldorp	no flags	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Disseldorp 2011-07-26 13:43:19 UTC

ctdb-1.0.114.2
samba-3.4.3 (with 

On rare occasions while the cluster is under heavy load we are seeing the recovery daemon shutdown as a result of encountering "Invalid packet of length 0" errors.

Adding Rusty's "Report client for queue errors" patch and dissecting the packet buffers we observe the following:

Bad packet data block 1: line number 129
=========================================
20:37:19: client/ctdb_client.c:789 control timed out. reqid:52007 opcode:9 dstnode:4026531841
20:37:19: client/ctdb_client.c:900 ctdb_control_recv failed
20:37:19: client/ctdb_client.c:1214 ctdb_control for getdbmap failed ret:-1 res:-1
20:37:19: server/ctdb_recoverd.c:818 Unable to get dbids from local node
20:37:19: to-ctdbd: Invalid packet of length 0 (nread = 13800, totread = 21776, partlen = 15920)
# packet A, byte offset 0, char number 48
58 5d 00 00     length;         = 23896  /* bogus - next header is at offset 15872 */
42 44 54 43     ctdb_magic;     = CTDB
01 00 00 00     ctdb_version;   = 1
06 f8 3a 12     generation;     = 0x123af806
05 00 00 00     operation;      = CTDB_REQ_MESSAGE
00 00 00 00     destnode;       = 0
01 00 00 00     srcnode;        = 1
00 00 00 00     reqid;          = 0

# packet B, byte offset 15872 != (packetA + packetA.length), char number 47664
30 00 00 00     length;         = 48
42 44 54 43     ctdb_magic;     = CTDB
01 00 00 00     ctdb_version;   = 1
06 f8 3a 12     generation;     = 0x123af806
08 00 00 00     operation;      = CTDB_REPLY_CONTROL
00 00 00 00     destnode;       = 0
01 00 00 00     srcnode;        = 1
28 cb 00 00     reqid;          = 0xcb28

# packet C, byte offset 15920 == (packetB + packetB.length), char number 47808
10 55 00 00     length;         = 21776
42 44 54 43     ctdb_magic;     = CTDB
01 00 00 00     ctdb_version;   = 1
06 f8 3a 12     generation;     = 0x123af806
05 00 00 00     operation;      = CTDB_REQ_MESSAGE
00 00 00 00     destnode;       = 0
01 00 00 00     srcnode;        = 1
00 00 00 00     reqid;          = 0

# END, byte offset 37696 == (packetC + packetC.length), char number 113136


Bad packet data block 2: line number 259
=========================================
20:42:34: to-ctdbd: Invalid packet of length 0 (nread = 2192, totread = 2192, partlen = 0)
00 00 00 00 
00 00 49 44 
2f 30 3a 32 
34 31 36 33 
2f 31 30 30 
00 1f 00 00 
00 00 00 00 
00 0f 00 00
...
Buffer is Completely bogus, no CTDB magic to be seen


Bad packet data block 3: line number 375
=========================================
21:02:43: client/ctdb_client.c:789 control timed out. reqid:96097 opcode:9 dstnode:4026531841
21:02:43: client/ctdb_client.c:900 ctdb_control_recv failed
21:02:43: client/ctdb_client.c:1214 ctdb_control for getdbmap failed ret:-1 res:-1
21:02:43: server/ctdb_recoverd.c:818 Unable to get dbids from local node
21:02:43: to-ctdbd: Invalid packet of length 0 (nread = 13360, totread = 22696, partlen = 15920)
# packet A, byte offset 0, char number 48
a8 62 00 00     length;         = 25256
42 44 54 43     ctdb_magic;     = CTDB
01 00 00 00     ctdb_version;   = 1
a6 b7 2c 54     generation;     = 0x542cb7a6
05 00 00 00     operation;      = CTDB_REQ_MESSAGE
00 00 00 00     destnode;       = 0
01 00 00 00     srcnode;        = 1
00 00 00 00     reqid;          = 0

# packet B, byte offset 15872 != (packetA + packetA.length), char number 47664
30 00 00 00     length;         = 48
42 44 54 43     ctdb_magic;     = CTDB
01 00 00 00     ctdb_version;   = 1
a6 b7 2c 54     generation;     = 0x542cb7a6
08 00 00 00     operation;      = CTDB_REPLY_CONTROL
00 00 00 00     destnode;       = 0
01 00 00 00     srcnode;        = 1
62 77 01 00     reqid;          = 0x00017762

# packet C, byte offset 15920 == (packetB + packetB.length), char number 47808
a8 58 00 00     length;         = 22696
42 44 54 43     ctdb_magic;     = CTDB
01 00 00 00     ctdb_version;   = 1
a6 b7 2c 54     generation;     = 0x542cb7a6
05 00 00 00     operation;      = CTDB_REQ_MESSAGE
00 00 00 00     destnode;       = 0
01 00 00 00     srcnode;        = 1
00 00 00 00     reqid;          = 0

# END, byte offset 38616 == (packetC + packetC.length), char number 115896


Bad packet data block 4: line number 480
=========================================
03:47:26: to-ctdbd: Invalid packet of length 0 (nread = 2288, totread = 2288, partlen = 0)
00 00 00 00 
00 00 49 44 
2f 30 3a 32 
33 30 38 31 
2f 31 30 30 
00 1f 00 00 
00 00 00 00 
00 0f 00 00 
...
Buffer is Completely bogus, no CTDB magic to be seen

Comment 1 David Disseldorp 2011-07-26 13:45:22 UTC

Created attachment 6715 [details]
ctdb log with dumped buffers for bad packets

Comment 2 David Disseldorp 2011-07-26 15:31:28 UTC

Following discussion on IRC; Rusy and Ronnie have encountered the same issue however report it fixed with the vacuuming rewrite, more specifically commit 3930c7796b72bbf275bbca8aaeceec3e705a964b. This commit is included in ctdb-1.0.114.2.

Comment 3 David Disseldorp 2011-07-31 01:52:26 UTC

The underlying problem appears to be the core ctdb IO layer, and AFAICT it's been there for a long time - queue_io_read() is not safe for reentry via the queue callback.

Consider the following scenario:
- ctdbd sends 2 CTDB_SRVID_VACUUM_FETCH messages to the recovery
  daemon (recoverd), the payload of both messages is large, resulting
  in a multiple write() attempts in queue_io_write().

- 1.5 (not yet the full 2) packets worth of data arrives at the other
  end of the pipe for processing by recoverd.

- recoverd queue_io_read() pulls the full 1.5 packets worth of data
  off the pipe.

- the first packet is pulled into a buffer and handed to
  queue->callback().

- In the meantime the remaining 0.5 of the second
  CTDB_SRVID_VACUUM_FETCH message arrives.

- the recoverd vacuum_fetch_handler() calls ctdb_ctrl_getdbname()
  from within the queue->callback().

- ctdb_ctrl_getdbname() sends a CTDB_CONTROL_GET_DBNAME message and
  *reenters* queue_io_read() awaiting the response.

- On reentry to queue_io_read() the 0.5 CTDB_SRVID_VACUUM_FETCH packet
  already pulled off the wire is unaccounted for. The remaining 0.5
  packet available is processed as if it were a new header, with the
  first four bytes assumed to be the packet length.

- At this point the packet stream is wrongly aligned and IPC comes to
  a grinding halt with "Invalid packet of length 0" errors.

Comment 4 David Disseldorp 2011-07-31 02:08:36 UTC

Created attachment 6737 [details]
Patch to make queue_io_read() safe for reentry

The following changes since commit 14a1bf4fff4ce4a48704d09324e4757fbf043f92:

  Version 1.0.114.2: Start a new series of versions 1.0.114.X based on 1.0.114 (2011-05-20 16:21:04 +0200)

are available in the git repository at:
  git://oss.sgi.com/ddiss/ctdb 114.2-bso8319-queue_io_read

David Disseldorp (1):
      io: Make queue_io_read() safe for reentry

 common/ctdb_io.c |  121 +++++++++++++++++++++++------------------------------
 1 files changed, 53 insertions(+), 68 deletions(-)

Comment 5 David Disseldorp 2011-07-31 13:01:01 UTC

(In reply to comment #4)
> Created attachment 6737 [details]
> Patch to make queue_io_read() safe for reentry

Oops, looks like I got a bit carried away and forgot to add checks for packet buffer allocation failures. New patch to come...

Comment 6 David Disseldorp 2011-07-31 13:09:09 UTC

Created attachment 6738 [details]
Patch to make queue_io_read() safe for reentry

The following changes since commit 14a1bf4fff4ce4a48704d09324e4757fbf043f92:

  Version 1.0.114.2: Start a new series of versions 1.0.114.X based on 1.0.114 (2011-05-20 16:21:04 +0200)

are available in the git repository at:
  git://oss.sgi.com/ddiss/ctdb 114.2-bso8319-queue_io_read_rb1

David Disseldorp (1):
      io: Make queue_io_read() safe for reentry

 common/ctdb_io.c |  130 ++++++++++++++++++++++++++----------------------------
 1 files changed, 62 insertions(+), 68 deletions(-)

Comment 7 David Disseldorp 2011-08-01 12:09:25 UTC

Created attachment 6742 [details]
smbclient workload used to trip the bug

Comment 8 David Disseldorp 2011-08-01 15:43:13 UTC

Created attachment 6743 [details]
Patch to make queue_io_read() safe for reentry - rb2

The previous patch incorrectly allowed for a 0 byte read if only the packet length bytes were available - fixed in this patch version (rb2).

--
The following changes since commit 14a1bf4fff4ce4a48704d09324e4757fbf043f92:

  Version 1.0.114.2: Start a new series of versions 1.0.114.X based on 1.0.114 (2011-05-20 16:21:04 +0200)

are available in the git repository at:
  git://oss.sgi.com/ddiss/ctdb 114.2-bso8319-queue_io_read_rb2

David Disseldorp (1):
      io: Make queue_io_read() safe for reentry

 common/ctdb_io.c |  129 ++++++++++++++++++++++++++----------------------------
 1 files changed, 62 insertions(+), 67 deletions(-)

Comment 9 David Disseldorp 2011-08-04 14:15:46 UTC

Awaiting review, adding Rusty and Ronnie to cc.

Comment 10 Ronnie Sahlberg 2011-08-05 05:51:53 UTC

Thanks.

Patch has been merged to master and 1.0.114 as well as my 1.2 branch

Comment 11 David Disseldorp 2011-08-05 08:39:26 UTC

(In reply to comment #10)
> Thanks.
> 
> Patch has been merged to master and 1.0.114 as well as my 1.2 branch

Thanks Ronnie, closing.