Bug 1476 - Unexplained timout at end of file transfer.
Summary: Unexplained timout at end of file transfer.
Status: CLOSED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.3
Hardware: All Linux
: P3 major (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-06-24 05:09 UTC by Edward Grace
Modified: 2005-04-01 11:21 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Edward Grace 2004-06-24 05:09:48 UTC
I have been using rsync for quite some time for automated backups, recently the
backups have begun failing for no apparent reason.

The appropriate environment variable is set for SSH and the command line used is
of the form:

rsync --includes=includes --excludes=excludes --timeout=120 --partial --stats
--temp-dir="$TARGET/tmp"  -a --delete --link-dest="$TARGET/$NAME_OF_LAST"
"$SOURCE" "$TARGET/InProgress"

Where the variables are self explanatory.

With the -v option I get the following error:

rsync error: timeout in data send/receive (code 30) at io.c(143)
rsync: connection unexpectedly closed (4023759 bytes read so far)
rsync: writefd_unbuffered failed to write 2248 bytes: phase "unknown": Broken pipe
rsync error: error in rsync protocol data stream (code 12) at io.c(836)


Without the -v option I get the slightly different error (note the different
lines of io.c):

rsync error: timeout in data send/receive (code 30) at io.c(143)
rsync: connection unexpectedly closed (3925349 bytes read so far)
rsync error: error in rsync protocol data stream (code 12) at io.c(342)
rsync: connection unexpectedly closed (3925349 bytes read so far)
rsync error: error in rsync protocol data stream (code 12) at io.c(342)

It appears that all the files have been sucessfully transfered but somehow rsync
is having difficulty "hanging up" the connection.  This problem first surfaced
with earlier versions of rsync but seems to be endemic. 

Any ideas of what the cause could be or what tests you would like me to carry
out for a better clue?  

I have been making use of the --link-dest option, the destination disk is only
60% full, the transfer is carried out from a remote machine to the local machine
over ssh.
Comment 1 Wayne Davison 2004-06-30 13:32:43 UTC
See the revised issues page for a script that you can try using that might help
you to see what was happening right before the timeout:

http://rsync.samba.org/issues.html

Make sure that the side you want to trace is the side with the HOST: prefix (if
there is no host in your command, use "localhost:").
Comment 2 Wayne Davison 2004-09-20 23:37:36 UTC
Closing due to lack of response from bug reporter.
Comment 3 Andrew Bartlett 2005-02-08 15:45:41 UTC
I appear to be seeing exactly the same issue, particularly on hosts using 2.6.3
(to the 2.6.3 server)

However, using the rsync-debug script, I find that the issue goes away!  It also
seem potentially timing or transport induced (just a gut feeling, that load and
using/not using SSH deamon transport have made the problem come and go).

This has happend more with more recent versions.
Comment 4 Edward Grace 2005-02-10 11:10:39 UTC
I'm glad that it was not just a product of my imagination!

For some reason I havn't been able to reliably reproduce the problem.  I pretty
much gave up.  I am currently using Using rsync-2.6.3 on a FC3 machine which is
connecting to rsync-2.6.3 on an FC2 based machine.  It doesn't seem to display
the problem any more, even with large file trees, but I am not convinced the
problem has gone, rather that due to some mystic alignment of the stars it is
not manifesting itself.

Your experience seems to suggest the same things as my guesswork, that there may
be some timing problem or strange interaction with SSH.  If you can reliably
reproduce the problem perhaps it's worth trying it with rsh as the transport
mechanism.

The other possibility is that there is some issue with the timeout option.  I
When I used to set the timeout option to (say) 10 minutes (far longer than the
actual transfer took) it would occur.  I have since set the timeout to 0
(indefinite) and have never seen this problem reoccur.

These type of intermittent s/w problem always fox me!

I will try and make it go wrong by setting a timeout of 10mins.  If that breaks
things again it may be informative to people "in the know".  I suggest you try
the same, see if setting the timeout option has any effect..

Good luck.
Comment 5 Andrew Bartlett 2005-02-10 12:20:55 UTC
I havn't specified the timeout option, and still get the issue.

I also have double-checked, and reliably suceeed on all the 'straced' backups so
far this week.
Comment 6 Edward Grace 2005-02-11 01:44:59 UTC
(In reply to comment #5)
> I havn't specified the timeout option, and still get the issue.
> 
> I also have double-checked, and reliably suceeed on all the 'straced' backups so
> far this week.

Hmm.  I've not managed to make it break.  I will try "loading" the machines
while doing the rsyncs and see if that has any effect.  The previous incarnation
was running on a machine with a slower processor and less RAM than the current
version.

Did you try rsh as the transport on the non straced version?


Comment 7 Edward Grace 2005-02-11 07:38:11 UTC
> Hmm.  I've not managed to make it break.  I will try "loading" the machines
> while doing the rsyncs and see if that has any effect. 

Well. It seems to have no effect (apart from slowing things down)...

It's strange, I really can't make it break.  I tried thrashing the disk and cpu
of the client machine and ran the one initiating the rsync request on a lower
powered slower machine with less RAM.

I'm afraid I am going to just have to shrug and say "don't know"....  I never
could make it break consistently... 

The only thing left at the back of my head is the ssh transport, I know the
behaviour of ssh when disconnected changed a while back.  Now, when one connects
to a server and some I/O could come back the connection remains open and you
have to explicitly disconnect with the ~. escape sequence...

  man ssh 

  See the section "Escape Characters"

Perhaps this is the interaction that's keeping the channel open and causing
rsync to barf?  I don't know enough about the process to really have an idea...

Good luck..
Comment 8 Labrador 2005-02-25 12:53:42 UTC
I've had rsync problems intermittantly in Gentoo.

Gentoo uses rsync within portage package management system.

Typically users run "emerge sync" to update their
portage tree prior to building or maintaining any
packages.

After spending a whole day trying to make emerge sync
run without a timeout in rsync/io.c I finally found my
solution.  I killed the frozen X window and mozilla
on the console and then emerge sync worked
reliably.  I've come to the conclusion that the timeout
error is completely bogus and the real issue is
something that falsely looks like a timeout.

See my bug report at Gentoo:
http://bugs.gentoo.org/show_bug.cgi?id=83254

Many Gentoo users suffer from this bug.  When
a system is borderline, one can keep trying and
eventually the rsync works.  Such users don't report
bugs, they just plug along and live with it.
Yesterday was the first day I've seen it fail
very consistantly.
Comment 9 Wayne Davison 2005-02-25 13:22:25 UTC
(In reply to comment #8)
> See my bug report at Gentoo:
> http://bugs.gentoo.org/show_bug.cgi?id=83254

That may be a different timeout than this bug is talking about.  In this bug the
rsync gets done, but doesn't exit, so it then times out.  The Gentoo problem
appears to be a timeout in the middle of the transfer.  I can easily imagine a
scenario where this could occur:

The generator is busily looking for files to download and sending the index of
any file that needs to be updated (followed by checksum data, if the file
exists).  If enough time elapses between finding changed files, it is possible
for a timeout to occur.  I'll look into potential fixes for this (which should
be pretty easy using the latest additions to protocol-29).
Comment 10 Wayne Davison 2005-02-25 16:07:44 UTC
The CVS code now handles the case where the generator is taking too long between
writes -- it either flushes any buffered data in the output buffer after
timeout/2 seconds have gone by, or it sends a new "no-op" packet that will keep
the connection alive.  This requires protocol 29 (i.e. 2.6.4 on both ends of the
connection).  The latest "nightly" tar file has this code in it, if you'd care
to try it.
Comment 11 John Van Essen 2005-02-26 01:26:07 UTC
I believe that using this "no-op" packet effectively implements the "keep alive"
option requested in bug 1889 without requiring an explicit option be given.  Nice.

I'm glad you decided to do this, given that the protocol number is being bumped
anyway due to other changes.

But just to be clear - this is not a fix for the bug originally reported, where
the hang occurs at the end, correct?

Also - is there a default for timeout/2 if no --timeout value has been given?
Comment 12 Wayne Davison 2005-02-26 13:45:40 UTC
(In reply to comment #11)
> I believe that using this "no-op" packet effectively implements
> the "keep alive" option requested in bug 1889 without requiring
> an explicit option be given.

It did once I moved the calls to the file-deletion passes into the generator (so
that things were setup properly for the new keep-alive packet to be used).

> But just to be clear - this is not a fix for the bug originally reported,
> where the hang occurs at the end, correct?

It might have.  I am now theorizing that the original bug's timeout could have
occurred during the directory-time-touching loop at the end of the generator's
run.  If so, the latest CVS code now handles this for a protocol-29 transfer.

On this assumption, I'm closing this bug.  Please reopen it if there is still a
problem with 2.6.4 (you can try out the latest "nightly" tar file to test it now).

> Also - is there a default for timeout/2 if no --timeout value has been given?

If no --timeout has been given, no keep-alive packets are currently sent.
Comment 13 Labrador 2005-02-26 14:12:54 UTC
I may be able to reproduce the original situation.

I can have a browser and X that are running with 90% load on
CPU 1 and 98% load on CPU 2.  Once I have waited for the emerge
timestamp to age, I can try the sync again and possibly reproduce the bug.

The question is, once I'm there and can reproduce the timeout from
io.c, what do we want to know?  Is there any stat from lsof, 
for example, that would help?

If you want to know what I've done to reproduce the scenario:
start X, launch 2 mozilla browsers (maybe one would work),
enter a continuous stream of some character into the address
bar of mozilla.  I can do this with double clicking scroll lock
and hitting some digit after.  It is the same thing as putting
an object on the keyboard (or just one key) so it autorepeats
some character forever.

I'm only using this weird scenario to reproduce what
happened to me.  I believe it will also simulate other
types of system saturation with different causes that also
lead to the io.c timeout error.

Please let me know any data you'd like to gather, or if there
is something I can add to the source to help debug it.  I'm
working with 2.6.3 right now.
Comment 14 Wayne Davison 2005-02-26 14:52:57 UTC
(In reply to comment #13)
> I may be able to reproduce the original situation.

The original situation is a timeout at the end of all transferred data, not in
the middle.  Also note that you'd need to be using the very latest 2.6.4 from
CVS (or the upcoming 2.6.4pre2) on BOTH ends of the transfer for any
reproduction to be noteworthy.  Any other scenario does not have the keep-alive fix.