When transferring between an rsync 2.6.4 client and daemon giving the "--timeout" option, the timeout is triggered despite the fact that data is still being transferred. This error did not occur with 2.6.3. I assume this has to do with the check_timeout() routine: if (!io_timeout || ignore_timeout) return; if (!last_io) { last_io = time(NULL); return; } The new variable (as of 2.6.4) ignore_timeout is changed in generator.c in two places. I think that this has the ill side-effect that last_io is not modified any more, so that when ignore_timeout is being reset, the timeout is triggered.
The generator's setting of the ignore_timeout variable is what allows the I/O code to continue updating last_io (the old method of clearing io_timeout in the generator used to leave last_io unchanged, which is no longer desireable). How short is your timeout? Are you using a delete option? I tried some tests, and couldn't duplicat the problem, so I'll need more details.
(In reply to comment #1) > The generator's setting of the ignore_timeout variable is what allows the I/O > code to continue updating last_io (the old method of clearing io_timeout in the > generator used to leave last_io unchanged, which is no longer desireable). > Yup, I was wrong with my first guess. I tried modifying the source to reset last_io when ignore_timeout is set to 1 and the error did not go away. last_io DOES get modified, but there is more than one last_io, as I found out. > How short is your timeout? Are you using a delete option? I tried some tests, > and couldn't duplicat the problem, so I'll need more details. The timeout was big enough (3600), there was no delete, but a fuzzy option specified. The files transferred were huge (800 MByte over a 512 KBit/s link) and had been changed only in the last portion (data was appended). The timeout did not matter, BTW, when I used 300 (and 1200) seconds, the timeout occured shortly after that time, too. To isolate the problem, I added rprintf() to the statements were last_io is being modified. Since there are always two rsync client processes, which hold an instance of last_io each, it looks that there were long sequences of read_timeout() activity which were not interspersed by writefd_unbuffered() with my kind of files. I could see that last_io was set in read_timeout() over and over again and that when check_timeout() was getting called, its instance of last_io (which definitely had been set by writefd_unbuffered, as I could glean from the value) was outdated. I think this has to do with these statements, which I don't completely understand: read_timeout(): if (io_timeout && fd == sock_f_in) last_io = time(NULL); writefd_unbuffered(): if (fd == sock_f_out) { if (io_timeout) last_io = time(NULL); sleep_for_bwlimit(ret); } Each "side" (i.e. process) seems to watch only one direction of the data flow? It seems since both sides are separated, too long pauses can arise in special situations (such as mine) because data flows unidirectionally into the "unwatched" direction for the specific process side and its last_io instance.
(In reply to comment #1) Oh, and one more thing: I am not longer sure if the bug did not exist in pre-2.6.4 versions. I did not see it before, but I had a mass-change with the side effect of the data being appended to my files.
You mentioned using --fuzzy: are there a lot of files in your destination directories? One thing I've noticed with --fuzzy is that if rsync puts a lot of new files into a directory with thousands of old files, the name-scoring algorithm can chew up a lot of CPU. It might be possible for the connection to timeout if this slowed rsync down enough. (The --fuzzy performance into large dirs should be improved at some point.) As for who counts what, the generator only sends data down the socket, so it only notes that data in the last_io time. The receiver only reads data from the socket, so it only notes that data in the last_io time. The sender both reads and writes data from/to the socket, so it notes both.
(In reply to comment #4) About --fuzzy: I doubt that this was the case. There are only up to 50 files in each directory (and from what I understood, the fuzzy matching takes place only in the file's own directory). Also, there was ongoing socket activity a few seconds before the actual timeout occured. I saw that last_io was modified every second or so in read_timeout() while the received file got bigger. Then at some point, the check_timeout failed with an older last_io time. That timestamp matched that of the last modification in write_unbuffered(). > > As for who counts what, the generator only sends data down the socket, so it > only notes that data in the last_io time. The receiver only reads data from the > socket, so it only notes that data in the last_io time. The sender both reads > and writes data from/to the socket, so it notes both. Yes. The sender was an rsync daemon in my case, whereas the client was separated into a generator and a receiver. Could that lead to a problem when only binary data (i.e. no deltas) are being sent (or the opposite, where just copying on the receiving side was neccessary)? Is there a possibility that the receiver is busy (modifying its last_io all the time) but the generator sits and waits for the next file to get processed or a select() timeout occurs and then the generator runs into the timeout in check_timeout() because the last time that write_unbuffered() modified its last_io was too long ago? This would mean that effectively, --timeout ought to large enough to accommodate not only pauses arising from directory scans, checksum calculations or fuzzy matching, but even data transfer itself if no (or maybe only) deltas are present. So, --timeout must in fact be longer than the expected transfer time of the largest file over a given connection in the worst case. I would expect --timeout to be a kind of "inactivity" counter to recognize stalled IP connections. But it seems that by separating the generator and receiver, there can be LONG pauses in one direction which trigger the timeout, although the connection is alive and well in the other direction (but that fact goes unnoticed). I guess in order to enable such bahaviour, the condition "io_timeout && fd == sock_f_in" respectively "io_timeout && fd == sock_f_out" should read something like "io_timeout && (fd == sock_f_in || fd == sock_f_out)" in order to catch "ACK" packets? If the receiver and generator see only one direction of the same socket, however, only a kind of shared-memory variable "last_io" could do that trick.
What may have happened is that the outgoing socket from the generator may have filled up completely, and the timeout on outputting new data then got exceeded waiting for the other two processes to clear out enough space to write more. I think my decision to partially enable timeouts in the generator now that we're generating ACK packets was not the right choice. Try changing this code: if (protocol_version < 29) ignore_timeout = 1; to remove the "if" and I think the timeout will go away.
(In reply to comment #6) > generating ACK packets was not the right choice. Try changing this code: > > if (protocol_version < 29) > ignore_timeout = 1; > > to remove the "if" and I think the timeout will go away. That did it for me. No timeouts any more. So the bug was present only since 2.6.4 in fact. I think this one may have gone unnoticed by now since few people synchronize that much data over semi-permanent connections (i.e. dynamic IPs), so that line drops have to be detected by the --timeout option. Also, the up- and downstreams differ greatly (ADSL) in my setup, so your assumptions about queues seem logical. Thanks for the prompt fixing!
Excellent! I've checked in a fix for this (and a big comment about why we like turning off the timeouts in the generator).