Bug 2654 - timeout is always triggered with 2.6.4
Summary: timeout is always triggered with 2.6.4
Status: CLOSED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.4
Hardware: All Windows XP
: P3 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-26 15:07 UTC by Dr. Uwe Meyer-Gruhl
Modified: 2005-06-03 01:39 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dr. Uwe Meyer-Gruhl 2005-04-26 15:07:34 UTC
When transferring between an rsync 2.6.4 client and daemon giving the
"--timeout" option, the timeout is triggered despite the fact that data is still
being transferred. This error did not occur with 2.6.3.

I assume this has to do with the check_timeout() routine:

        if (!io_timeout || ignore_timeout)
                return;

        if (!last_io) {
                last_io = time(NULL);
                return;
        }

The new variable (as of 2.6.4) ignore_timeout is changed in generator.c in two
places. I think that this has the ill side-effect that last_io is not modified
any more, so that when ignore_timeout is being reset, the timeout is triggered.
Comment 1 Wayne Davison 2005-04-27 02:21:24 UTC
The generator's setting of the ignore_timeout variable is what allows the I/O
code to continue updating last_io (the old method of clearing io_timeout in the
generator used to leave last_io unchanged, which is no longer desireable).

How short is your timeout?  Are you using a delete option?  I tried some tests,
and couldn't duplicat the problem, so I'll need more details.
Comment 2 Dr. Uwe Meyer-Gruhl 2005-04-27 02:54:35 UTC
(In reply to comment #1)
> The generator's setting of the ignore_timeout variable is what allows the I/O
> code to continue updating last_io (the old method of clearing io_timeout in the
> generator used to leave last_io unchanged, which is no longer desireable).
> 

Yup, I was wrong with my first guess. I tried modifying the source to reset
last_io when ignore_timeout is set to 1 and the error did not go away. last_io
DOES get modified, but there is more than one last_io, as I found out.

> How short is your timeout?  Are you using a delete option?  I tried some tests,
> and couldn't duplicat the problem, so I'll need more details.

The timeout was big enough (3600), there was no delete, but a fuzzy option
specified. The files transferred were huge (800 MByte over a 512 KBit/s link)
and had been changed only in the last portion (data was appended). The timeout
did not matter, BTW, when I used 300 (and 1200) seconds, the timeout occured
shortly after that time, too.

To isolate the problem, I added rprintf() to the statements were last_io is
being modified. Since there are always two rsync client processes, which hold an
instance of last_io each, it looks that there were long sequences of
read_timeout() activity which were not interspersed by writefd_unbuffered() with
my kind of files.

I could see that last_io was set in read_timeout() over and over again and that
when check_timeout() was getting called, its instance of last_io (which
definitely had been set by writefd_unbuffered, as I could glean from the value)
was outdated.

I think this has to do with these statements, which I don't completely understand:

read_timeout():

                if (io_timeout && fd == sock_f_in)
                        last_io = time(NULL);

writefd_unbuffered():

                if (fd == sock_f_out) {
                        if (io_timeout)
                                last_io = time(NULL);
                        sleep_for_bwlimit(ret);
                }

Each "side" (i.e. process) seems to watch only one direction of the data flow?
It seems since both sides are separated, too long pauses can arise in special
situations (such as mine) because data flows unidirectionally into the
"unwatched" direction for the specific process side and its last_io instance.
Comment 3 Dr. Uwe Meyer-Gruhl 2005-04-27 02:58:26 UTC
(In reply to comment #1)
Oh, and one more thing: I am not longer sure if the bug did not exist in
pre-2.6.4 versions. I did not see it before, but I had a mass-change with the
side effect of the data being appended to my files.
Comment 4 Wayne Davison 2005-04-27 08:39:42 UTC
You mentioned using --fuzzy:  are there a lot of files in your destination
directories?  One thing I've noticed with --fuzzy is that if rsync puts a lot of
new files into a directory with thousands of old files, the name-scoring
algorithm can chew up a lot of CPU.  It might be possible for the connection to
timeout if this slowed rsync down enough.  (The --fuzzy performance into large
dirs should be improved at some point.)

As for who counts what, the generator only sends data down the socket, so it
only notes that data in the last_io time.  The receiver only reads data from the
socket, so it only notes that data in the last_io time.  The sender both reads
and writes data from/to the socket, so it notes both.
Comment 5 Dr. Uwe Meyer-Gruhl 2005-04-27 09:26:57 UTC
(In reply to comment #4)

About --fuzzy: I doubt that this was the case. There are only up to 50 files in
each directory (and from what I understood, the fuzzy matching takes place only
in the file's own directory). Also, there was ongoing socket activity a few
seconds before the actual timeout occured. I saw that last_io was modified every
second or so in read_timeout() while the received file got bigger. Then at some
point, the check_timeout failed with an older last_io time. That timestamp
matched that of the last modification in write_unbuffered().

> 
> As for who counts what, the generator only sends data down the socket, so it
> only notes that data in the last_io time.  The receiver only reads data from the
> socket, so it only notes that data in the last_io time.  The sender both reads
> and writes data from/to the socket, so it notes both.

Yes. The sender was an rsync daemon in my case, whereas the client was separated
into a generator and a receiver. Could that lead to a problem when only binary
data (i.e. no deltas) are being sent (or the opposite, where just copying on the
receiving side was neccessary)? Is there a possibility that the receiver is busy
(modifying its last_io all the time) but the generator sits and waits for the
next file to get processed or a select() timeout occurs and then the generator
runs into the timeout in check_timeout() because the last time that
write_unbuffered() modified its last_io was too long ago?

This would mean that effectively, --timeout ought to large enough to accommodate
not only pauses arising from directory scans, checksum calculations or fuzzy
matching, but even data transfer itself if no (or maybe only) deltas are
present. So, --timeout must in fact be longer than the expected transfer time of
the largest file over a given connection in the worst case.

I would expect --timeout to be a kind of "inactivity" counter to recognize
stalled IP connections. But it seems that by separating the generator and
receiver, there can be LONG pauses in one direction which trigger the timeout,
although the connection is alive and well in the other direction (but that fact
goes unnoticed).

I guess in order to enable such bahaviour, the condition "io_timeout && fd ==
sock_f_in" respectively "io_timeout && fd == sock_f_out" should read something
like "io_timeout && (fd == sock_f_in || fd == sock_f_out)" in order to catch
"ACK" packets? If the receiver and generator see only one direction of the same
socket, however, only a kind of shared-memory variable "last_io" could do that
trick.
Comment 6 Wayne Davison 2005-04-27 10:47:32 UTC
What may have happened is that the outgoing socket from the generator may have
filled up completely, and the timeout on outputting new data then got exceeded
waiting for the other two processes to clear out enough space to write more.  I
think my decision to partially enable timeouts in the generator now that we're
generating ACK packets was not the right choice.  Try changing this code:

        if (protocol_version < 29)
                ignore_timeout = 1;

to remove the "if" and I think the timeout will go away.
Comment 7 Dr. Uwe Meyer-Gruhl 2005-04-27 14:08:18 UTC
(In reply to comment #6)
> generating ACK packets was not the right choice.  Try changing this code:
> 
>         if (protocol_version < 29)
>                 ignore_timeout = 1;
> 
> to remove the "if" and I think the timeout will go away.

That did it for me. No timeouts any more. So the bug was present only since
2.6.4 in fact. I think this one may have gone unnoticed by now since few people
synchronize that much data over semi-permanent connections (i.e. dynamic IPs),
so that line drops have to be detected by the --timeout option.

Also, the up- and downstreams differ greatly (ADSL) in my setup, so your
assumptions about queues seem logical.

Thanks for the prompt fixing!
Comment 8 Wayne Davison 2005-04-27 15:59:02 UTC
Excellent!  I've checked in a fix for this (and a big comment about why we like
turning off the timeouts in the generator).