Bug 1442 - rsync sender appears to hang when receiver recounters an error
rsync sender appears to hang when receiver recounters an error
Status: REOPENED
Product: rsync
Classification: Unclassified
Component: core
2.6.2
All Linux
: P3 normal
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-06-08 08:00 UTC by Tom Pinkl
Modified: 2007-10-18 11:38 UTC (History)
2 users (show)

See Also:


Attachments
strace of receiver process (1.32 KB, application/octet-stream)
2004-06-08 13:26 UTC, Tom Pinkl
no flags Details
server-side syscall trace (2.65 KB, text/plain)
2007-03-28 18:41 UTC, Ahmon Dancy
no flags Details
receiver-backtrace.txt (2.33 KB, text/plain)
2007-10-18 11:36 UTC, Matt Domsch
no flags Details
sender-backtrace.txt (1.73 KB, text/plain)
2007-10-18 11:37 UTC, Matt Domsch
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Pinkl 2004-06-08 08:00:27 UTC
As sent to the rsync mailing list on May 4, 2004:

(rsync 2.6.2 between two Linux systems) ...

What happens is that the sending rsync just appears to hang.  The
receiving rsync processes are no longer running when I go to look,
typically the next morning.
                                                                               
                                          
I finally managed to capture an strace of the receiving rsync
processes, which I've attached below.
                                                                               
                                          
The child receiver process gets an error return on a write(),
informs its parent by sending it two error messages, and calls
exit(11).  The parent receiver process reads the first of the
two error messages, but not the second.  It handles the SIGCHLD
signal after read()ing the first error message.  Thereafter, the
patent's select() no longer includes the file descriptor on which
(presumably) the second error message is waiting.  And it never
informs the sending rsync process of the error.

(strace output) ...
Comment 1 Tom Pinkl 2004-06-08 13:26:35 UTC
Created attachment 540 [details]
strace of receiver process
Comment 2 Wayne Davison 2004-06-08 14:37:41 UTC
It's not a bug that the select() no longer includes the file descriptor for
reading the next message because rsync is busy trying to write the first message
over the socket to the sending (client) side.  As soon as that were to succeed,
it would read the next message.

So, the real question is, Why can't rsync send its message over the socket?  Is
the remote-shell process hung?  What is happening on the sending side?
Comment 3 Carson Gaspar 2004-06-08 19:02:31 UTC
rsync will definitely hang in debug mode, because it tries to write more than
the TCP window before it is willing to read. The other side is doing the same
thing, so both block on writes in a deadly embrace. I doubt this is the cause
here, but the I/O loop needs to be fixed. A recent CVS commit may have improved
things, but I haven't looked at it.
Comment 4 Marc.Herbert 2004-06-15 16:33:19 UTC
I met a similar issue with rsync 2.6.2 at both ends.

Because of some silly partitioning mistake of mine, it took me very long to realize that the destination disk was completely full. One thing is sure: rsync did not help met _at all_ realize that. Instead of emitting some "disk error/full" message as one could hope, it just silently _hung_ (both sender and receiver sleeping on select().

As soon as some space was freed on the destination, everything was fine: I am almost sure that small socket buffers and deadlocks were totally unrelated to my "hanging rsync" problem. It seems to be only an issue of error handling.

Comment 5 Wayne Davison 2004-08-03 11:07:39 UTC
The CVS version has been changed to better propigate fatal errors across the
socket.  I'd be interested to know if the CVS version still has this problem (as
I cannot duplicate it).
Comment 6 Wayne Davison 2004-09-20 23:37:01 UTC
Closing due to lack of response from bug reporter.
Comment 7 Ahmon Dancy 2007-03-05 12:35:20 UTC
This problem still persists in 2.6.9 (protocol version 29).  It is strangely hard to reproduce but I do have a reprodicible case at my disposal.  I would like to see this bug reopened.  I can do any experimentation required to move the progress along.

Comment 8 Wayne Davison 2007-03-27 17:46:51 UTC
If you can reproduce this bug, please let me know how I can help you figure out what is going on.  (Perhaps by looking over strace output, or whatever.)
Comment 9 Ahmon Dancy 2007-03-28 18:41:49 UTC
Created attachment 2352 [details]
server-side syscall trace 

Attached is the server-side strace during a filesystem full situation.  I've noted where the process hangs, at which point I hit control-C on the client side.
Comment 10 Ahmon Dancy 2007-03-28 18:42:38 UTC
p.s., I'm using rsync-2.6.9-2.fc5 (as distributed with Fedora Core 5).
Comment 11 Matt Domsch 2007-10-18 11:36:43 UTC
Created attachment 2945 [details]
receiver-backtrace.txt

This may be related - or not.  I see a hang while copying files between two systems in daemon mode, using 3.0.0.pre2.  I'll attach gdb backtraces of both the sender and receiver when they're hung.
Comment 12 Matt Domsch 2007-10-18 11:37:51 UTC
Created attachment 2946 [details]
sender-backtrace.txt