The Samba-Bugzilla – Bug 1442
rsync sender appears to hang when receiver recounters an error
Last modified: 2007-10-18 11:38:17 UTC
As sent to the rsync mailing list on May 4, 2004:
(rsync 2.6.2 between two Linux systems) ...
What happens is that the sending rsync just appears to hang. The
receiving rsync processes are no longer running when I go to look,
typically the next morning.
I finally managed to capture an strace of the receiving rsync
processes, which I've attached below.
The child receiver process gets an error return on a write(),
informs its parent by sending it two error messages, and calls
exit(11). The parent receiver process reads the first of the
two error messages, but not the second. It handles the SIGCHLD
signal after read()ing the first error message. Thereafter, the
patent's select() no longer includes the file descriptor on which
(presumably) the second error message is waiting. And it never
informs the sending rsync process of the error.
(strace output) ...
Created attachment 540 [details]
strace of receiver process
It's not a bug that the select() no longer includes the file descriptor for
reading the next message because rsync is busy trying to write the first message
over the socket to the sending (client) side. As soon as that were to succeed,
it would read the next message.
So, the real question is, Why can't rsync send its message over the socket? Is
the remote-shell process hung? What is happening on the sending side?
rsync will definitely hang in debug mode, because it tries to write more than
the TCP window before it is willing to read. The other side is doing the same
thing, so both block on writes in a deadly embrace. I doubt this is the cause
here, but the I/O loop needs to be fixed. A recent CVS commit may have improved
things, but I haven't looked at it.
I met a similar issue with rsync 2.6.2 at both ends.
Because of some silly partitioning mistake of mine, it took me very long to realize that the destination disk was completely full. One thing is sure: rsync did not help met _at all_ realize that. Instead of emitting some "disk error/full" message as one could hope, it just silently _hung_ (both sender and receiver sleeping on select().
As soon as some space was freed on the destination, everything was fine: I am almost sure that small socket buffers and deadlocks were totally unrelated to my "hanging rsync" problem. It seems to be only an issue of error handling.
The CVS version has been changed to better propigate fatal errors across the
socket. I'd be interested to know if the CVS version still has this problem (as
I cannot duplicate it).
Closing due to lack of response from bug reporter.
This problem still persists in 2.6.9 (protocol version 29). It is strangely hard to reproduce but I do have a reprodicible case at my disposal. I would like to see this bug reopened. I can do any experimentation required to move the progress along.
If you can reproduce this bug, please let me know how I can help you figure out what is going on. (Perhaps by looking over strace output, or whatever.)
Created attachment 2352 [details]
server-side syscall trace
Attached is the server-side strace during a filesystem full situation. I've noted where the process hangs, at which point I hit control-C on the client side.
p.s., I'm using rsync-2.6.9-2.fc5 (as distributed with Fedora Core 5).
Created attachment 2945 [details]
This may be related - or not. I see a hang while copying files between two systems in daemon mode, using 3.0.0.pre2. I'll attach gdb backtraces of both the sender and receiver when they're hung.
Created attachment 2946 [details]