As sent to the rsync mailing list on May 4, 2004: (rsync 2.6.2 between two Linux systems) ... What happens is that the sending rsync just appears to hang. The receiving rsync processes are no longer running when I go to look, typically the next morning. I finally managed to capture an strace of the receiving rsync processes, which I've attached below. The child receiver process gets an error return on a write(), informs its parent by sending it two error messages, and calls exit(11). The parent receiver process reads the first of the two error messages, but not the second. It handles the SIGCHLD signal after read()ing the first error message. Thereafter, the patent's select() no longer includes the file descriptor on which (presumably) the second error message is waiting. And it never informs the sending rsync process of the error. (strace output) ...
Created attachment 540 [details] strace of receiver process
It's not a bug that the select() no longer includes the file descriptor for reading the next message because rsync is busy trying to write the first message over the socket to the sending (client) side. As soon as that were to succeed, it would read the next message. So, the real question is, Why can't rsync send its message over the socket? Is the remote-shell process hung? What is happening on the sending side?
rsync will definitely hang in debug mode, because it tries to write more than the TCP window before it is willing to read. The other side is doing the same thing, so both block on writes in a deadly embrace. I doubt this is the cause here, but the I/O loop needs to be fixed. A recent CVS commit may have improved things, but I haven't looked at it.
I met a similar issue with rsync 2.6.2 at both ends. Because of some silly partitioning mistake of mine, it took me very long to realize that the destination disk was completely full. One thing is sure: rsync did not help met _at all_ realize that. Instead of emitting some "disk error/full" message as one could hope, it just silently _hung_ (both sender and receiver sleeping on select(). As soon as some space was freed on the destination, everything was fine: I am almost sure that small socket buffers and deadlocks were totally unrelated to my "hanging rsync" problem. It seems to be only an issue of error handling.
The CVS version has been changed to better propigate fatal errors across the socket. I'd be interested to know if the CVS version still has this problem (as I cannot duplicate it).
Closing due to lack of response from bug reporter.
This problem still persists in 2.6.9 (protocol version 29). It is strangely hard to reproduce but I do have a reprodicible case at my disposal. I would like to see this bug reopened. I can do any experimentation required to move the progress along.
If you can reproduce this bug, please let me know how I can help you figure out what is going on. (Perhaps by looking over strace output, or whatever.)
Created attachment 2352 [details] server-side syscall trace Attached is the server-side strace during a filesystem full situation. I've noted where the process hangs, at which point I hit control-C on the client side.
p.s., I'm using rsync-2.6.9-2.fc5 (as distributed with Fedora Core 5).
Created attachment 2945 [details] receiver-backtrace.txt This may be related - or not. I see a hang while copying files between two systems in daemon mode, using 3.0.0.pre2. I'll attach gdb backtraces of both the sender and receiver when they're hung.
Created attachment 2946 [details] sender-backtrace.txt
There are a lot of bugreports related to rsync hanging mysteriously, some of which may be duplicates of each other: https://bugzilla.samba.org/show_bug.cgi?id=1442 https://bugzilla.samba.org/show_bug.cgi?id=2957 https://bugzilla.samba.org/show_bug.cgi?id=9164 https://bugzilla.samba.org/show_bug.cgi?id=10035 https://bugzilla.samba.org/show_bug.cgi?id=10092 https://bugzilla.samba.org/show_bug.cgi?id=10518 https://bugzilla.samba.org/show_bug.cgi?id=10950 https://bugzilla.samba.org/show_bug.cgi?id=11166 https://bugzilla.samba.org/show_bug.cgi?id=12732 https://bugzilla.samba.org/show_bug.cgi?id=13109