I have been using rsync for quite some time for automated backups, recently the backups have begun failing for no apparent reason. The appropriate environment variable is set for SSH and the command line used is of the form: rsync --includes=includes --excludes=excludes --timeout=120 --partial --stats --temp-dir="$TARGET/tmp" -a --delete --link-dest="$TARGET/$NAME_OF_LAST" "$SOURCE" "$TARGET/InProgress" Where the variables are self explanatory. With the -v option I get the following error: rsync error: timeout in data send/receive (code 30) at io.c(143) rsync: connection unexpectedly closed (4023759 bytes read so far) rsync: writefd_unbuffered failed to write 2248 bytes: phase "unknown": Broken pipe rsync error: error in rsync protocol data stream (code 12) at io.c(836) Without the -v option I get the slightly different error (note the different lines of io.c): rsync error: timeout in data send/receive (code 30) at io.c(143) rsync: connection unexpectedly closed (3925349 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(342) rsync: connection unexpectedly closed (3925349 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(342) It appears that all the files have been sucessfully transfered but somehow rsync is having difficulty "hanging up" the connection. This problem first surfaced with earlier versions of rsync but seems to be endemic. Any ideas of what the cause could be or what tests you would like me to carry out for a better clue? I have been making use of the --link-dest option, the destination disk is only 60% full, the transfer is carried out from a remote machine to the local machine over ssh.
See the revised issues page for a script that you can try using that might help you to see what was happening right before the timeout: http://rsync.samba.org/issues.html Make sure that the side you want to trace is the side with the HOST: prefix (if there is no host in your command, use "localhost:").
Closing due to lack of response from bug reporter.
I appear to be seeing exactly the same issue, particularly on hosts using 2.6.3 (to the 2.6.3 server) However, using the rsync-debug script, I find that the issue goes away! It also seem potentially timing or transport induced (just a gut feeling, that load and using/not using SSH deamon transport have made the problem come and go). This has happend more with more recent versions.
I'm glad that it was not just a product of my imagination! For some reason I havn't been able to reliably reproduce the problem. I pretty much gave up. I am currently using Using rsync-2.6.3 on a FC3 machine which is connecting to rsync-2.6.3 on an FC2 based machine. It doesn't seem to display the problem any more, even with large file trees, but I am not convinced the problem has gone, rather that due to some mystic alignment of the stars it is not manifesting itself. Your experience seems to suggest the same things as my guesswork, that there may be some timing problem or strange interaction with SSH. If you can reliably reproduce the problem perhaps it's worth trying it with rsh as the transport mechanism. The other possibility is that there is some issue with the timeout option. I When I used to set the timeout option to (say) 10 minutes (far longer than the actual transfer took) it would occur. I have since set the timeout to 0 (indefinite) and have never seen this problem reoccur. These type of intermittent s/w problem always fox me! I will try and make it go wrong by setting a timeout of 10mins. If that breaks things again it may be informative to people "in the know". I suggest you try the same, see if setting the timeout option has any effect.. Good luck.
I havn't specified the timeout option, and still get the issue. I also have double-checked, and reliably suceeed on all the 'straced' backups so far this week.
(In reply to comment #5) > I havn't specified the timeout option, and still get the issue. > > I also have double-checked, and reliably suceeed on all the 'straced' backups so > far this week. Hmm. I've not managed to make it break. I will try "loading" the machines while doing the rsyncs and see if that has any effect. The previous incarnation was running on a machine with a slower processor and less RAM than the current version. Did you try rsh as the transport on the non straced version?
> Hmm. I've not managed to make it break. I will try "loading" the machines > while doing the rsyncs and see if that has any effect. Well. It seems to have no effect (apart from slowing things down)... It's strange, I really can't make it break. I tried thrashing the disk and cpu of the client machine and ran the one initiating the rsync request on a lower powered slower machine with less RAM. I'm afraid I am going to just have to shrug and say "don't know".... I never could make it break consistently... The only thing left at the back of my head is the ssh transport, I know the behaviour of ssh when disconnected changed a while back. Now, when one connects to a server and some I/O could come back the connection remains open and you have to explicitly disconnect with the ~. escape sequence... man ssh See the section "Escape Characters" Perhaps this is the interaction that's keeping the channel open and causing rsync to barf? I don't know enough about the process to really have an idea... Good luck..
I've had rsync problems intermittantly in Gentoo. Gentoo uses rsync within portage package management system. Typically users run "emerge sync" to update their portage tree prior to building or maintaining any packages. After spending a whole day trying to make emerge sync run without a timeout in rsync/io.c I finally found my solution. I killed the frozen X window and mozilla on the console and then emerge sync worked reliably. I've come to the conclusion that the timeout error is completely bogus and the real issue is something that falsely looks like a timeout. See my bug report at Gentoo: http://bugs.gentoo.org/show_bug.cgi?id=83254 Many Gentoo users suffer from this bug. When a system is borderline, one can keep trying and eventually the rsync works. Such users don't report bugs, they just plug along and live with it. Yesterday was the first day I've seen it fail very consistantly.
(In reply to comment #8) > See my bug report at Gentoo: > http://bugs.gentoo.org/show_bug.cgi?id=83254 That may be a different timeout than this bug is talking about. In this bug the rsync gets done, but doesn't exit, so it then times out. The Gentoo problem appears to be a timeout in the middle of the transfer. I can easily imagine a scenario where this could occur: The generator is busily looking for files to download and sending the index of any file that needs to be updated (followed by checksum data, if the file exists). If enough time elapses between finding changed files, it is possible for a timeout to occur. I'll look into potential fixes for this (which should be pretty easy using the latest additions to protocol-29).
The CVS code now handles the case where the generator is taking too long between writes -- it either flushes any buffered data in the output buffer after timeout/2 seconds have gone by, or it sends a new "no-op" packet that will keep the connection alive. This requires protocol 29 (i.e. 2.6.4 on both ends of the connection). The latest "nightly" tar file has this code in it, if you'd care to try it.
I believe that using this "no-op" packet effectively implements the "keep alive" option requested in bug 1889 without requiring an explicit option be given. Nice. I'm glad you decided to do this, given that the protocol number is being bumped anyway due to other changes. But just to be clear - this is not a fix for the bug originally reported, where the hang occurs at the end, correct? Also - is there a default for timeout/2 if no --timeout value has been given?
(In reply to comment #11) > I believe that using this "no-op" packet effectively implements > the "keep alive" option requested in bug 1889 without requiring > an explicit option be given. It did once I moved the calls to the file-deletion passes into the generator (so that things were setup properly for the new keep-alive packet to be used). > But just to be clear - this is not a fix for the bug originally reported, > where the hang occurs at the end, correct? It might have. I am now theorizing that the original bug's timeout could have occurred during the directory-time-touching loop at the end of the generator's run. If so, the latest CVS code now handles this for a protocol-29 transfer. On this assumption, I'm closing this bug. Please reopen it if there is still a problem with 2.6.4 (you can try out the latest "nightly" tar file to test it now). > Also - is there a default for timeout/2 if no --timeout value has been given? If no --timeout has been given, no keep-alive packets are currently sent.
I may be able to reproduce the original situation. I can have a browser and X that are running with 90% load on CPU 1 and 98% load on CPU 2. Once I have waited for the emerge timestamp to age, I can try the sync again and possibly reproduce the bug. The question is, once I'm there and can reproduce the timeout from io.c, what do we want to know? Is there any stat from lsof, for example, that would help? If you want to know what I've done to reproduce the scenario: start X, launch 2 mozilla browsers (maybe one would work), enter a continuous stream of some character into the address bar of mozilla. I can do this with double clicking scroll lock and hitting some digit after. It is the same thing as putting an object on the keyboard (or just one key) so it autorepeats some character forever. I'm only using this weird scenario to reproduce what happened to me. I believe it will also simulate other types of system saturation with different causes that also lead to the io.c timeout error. Please let me know any data you'd like to gather, or if there is something I can add to the source to help debug it. I'm working with 2.6.3 right now.
(In reply to comment #13) > I may be able to reproduce the original situation. The original situation is a timeout at the end of all transferred data, not in the middle. Also note that you'd need to be using the very latest 2.6.4 from CVS (or the upcoming 2.6.4pre2) on BOTH ends of the transfer for any reproduction to be noteworthy. Any other scenario does not have the keep-alive fix.