hello, I'm using (or trying to use) rsync over an openvpn(8) tunnel over an ADSL link to sync 2 repositories. On one end there's a Slackware 10.1, on the other Debian 2.2 and 3.x. I was using distro-provided rsync versions, then grabbed 2.6.4, then latest CVS 20050418, but still get the problem. I thought it was related to the -v + long list bug mentioned in 2.6.4 rel. notes, but seems not so. What happens is, that rsync just hangs while - I think - transferring dir listing, till a cron process kills it. Seems it doesn't hang always on same point. I've tried rsync --demon --no-detach --bwlimit 20 -vvv (w/ and w/out bwlimit) but that issues no log at all, despite -v. So I run rsync by strace -f, and it just sit on a select(6, [5], ...) while the client, run with -vvv stops listing files on a line with generate_files phase=1 and hangs there forever; adding --timeout 90 it exit that lock with io timeout after 120 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(181) rsync: connection unexpectedly closed (11628 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(420) I kept the demon 2.6.4 while the client is currently the CVS version mentioned. Any workaround? something I'm missing? thanks -- paolo
You didn't mention the client command you're running. As for the log-data output by a --daemon, did you look in the appropriate syslog file? or check the rsyncd.conf file to see if it is logging to a file? You can also use the "max verbosity" setting to increase the level of verbosity that the daemon will allow a client to request (since it affects the log size, it defaults to 1).
yup, sorry, here it is: rsync -rlptD -v -u --delete --timeout=90 \ --exclude-from=/etc/rsync/rsync-exc rsync://server/remotedir /localdir/ I'm using --timeout now to at least let it go on on its own. I don't have log options in rsyncd.conf. Anyway, I'd expect --no-detach to enable logging to console as well. In syslog I see only: rsync error: timeout in data send/receive (code 30) at io.c(181)
Q: Does changing the level of verbosity get rid of the hang? (You are right that this is not the same hang as the -vvv bug mention in the NEWS file because that would only occur when pushing files, not pulling.) Q: What does netstat show for the state of the connection on each host? Q: What system call is each of the 3 processes stuck on? (Use strace, truss, or whatever is appropriate for your OS.)
sorry 4 late reply, I had to revert back to rsync setup for doing the test. here's what netstat -tpn|grep 873 shows... tcp 0 2347 192.168.10.254:1739 192.168.0.2:873 ESTABLISHED 19437/rsync tcp 0 4053 192.168.10.254:1329 192.168.0.2:873 LAST_ACK - tcp 0 2588 192.168.10.254:1049 192.168.0.2:873 LAST_ACK - ... when client is stuck in ... [pid 19437] select(4, NULL, [3], NULL, {60, 0}) = 1 (out [3], left {60, 0}) [pid 19437] write(3, "\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1\0\0\20\000". .., 3700) = 3700 [pid 19437] time(NULL) = 1115820917 [pid 19437] select(5, [4], [], NULL, {60, 0} <unfinished ...> [pid 19470] <... select resumed> ) = 1 (in [3], left {58, 270000}) [pid 19470] read(3, "\230\1\0\7", 4) = 4 [pid 19470] select(4, [3], [], NULL, {60, 0}) = 1 (in [3], left {60, 0}) [pid 19470] read(3, "\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1\0\0\20\000".. ., 408) = 408 [pid 19470] select(4, [3], [], NULL, {60, 0} <unfinished ...> [pid 19437] <... select resumed> ) = 0 (Timeout) [pid 19437] select(5, [4], [], NULL, {60, 0} <unfinished ...> [pid 19470] <... select resumed> ) = 0 (Timeout) [pid 19470] select(4, [3], [], NULL, {60, 0} ... and correspondingly server is waiting on: [pid 23453] gettimeofday({1115820907, 675213}, NULL) = 0 [pid 23453] select(6, [5], [], NULL, {60, 0}) = 1 (in [5], left {54, 470000}) [pid 23453] read(5, "\2\0\0\0\20\0B\0\0\0\20\0d\0\0\0\20\0m\0\0\0\20\0n\0\0"..., 8184) = 210 [pid 23453] select(6, NULL, [5], NULL, {60, 0}) = 1 (out [5], left {60, 0}) [pid 23453] write(5, "\314\0\0\7\2\0\0\0\20\0B\0\0\0\20\0d\0\0\0\20\0m\0\0\0"..., 208) = 208 [pid 23453] gettimeofday({1115820913, 203228}, NULL) = 0 [pid 23453] select(6, NULL, [5], NULL, {60, 0}) = 1 (out [5], left {60, 0}) [pid 23453] write(5, "\6\0\0\7\263\0\0\0\20\0", 10) = 10 [pid 23453] gettimeofday({1115820913, 205305}, NULL) = 0 [pid 23453] select(6, [5], [], NULL, {60, 0}) = 1 (in [5], left {58, 120000}) [pid 23453] read(5, "\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1\0\0\20\000"..., 8184) = 1353 [pid 23453] select(6, NULL, [5], NULL, {60, 0}) = 1 (out [5], left {60, 0}) [pid 23453] write(5, "\230\1\0\7\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1"..., 412) = 412 [pid 23453] gettimeofday({1115820915, 84990}, NULL) = 0 [pid 23453] select(6, [5], [], NULL, {60, 0}) = 0 (Timeout) [pid 23453] select(6, [5], [], NULL, {60, 0} -- paolo
in prev. msg forgot to mention that the number of -v seems to have no effect - actually that test was run with not a single -v on either side; and, btw, as noted -v... seems to have no effect at all with --daemon. -- paolo
Thanks for the extra info. The system-call traces show that the receiver is waiting to read data from the sender, the sender is waiting to read data from the generator, and the generator is waiting for data from the receiver. Combined with your initial report, this would place the code in the generator at the spot where it had just finished sending all checksum data to the sender, sent a -1 (for the end of phase 0), and is now waiting around for redo numbers or the next end-of-phase message from the receiver. So, the question is where did some data get lost? The call to get_redo_num() should have flushed all the buffered output data that the generator had, so it would be good to verify that the last buffer of data (the 3700 bytes in that run) contained the 4 bytes of 0xFF at the end. If so, what is the sender waiting for? Did it not receive the data? Alternately, if the 3700 bytes of data did not have the -1, why didn't all the generator's data flush? A comment in your initial report said that you started out using distro-provided rsync versions -- that makes me wonder what versions of rsync were exhibiting the hang? Is there any chance that the openvpn tunnel is failing to flush? Or losing data?
The rsync link started with rsync as in Slackware 9.0, then I kept up with distro rels and later with cvs, locally compiling werever needed (one end runs Debian 2.2 - Potato). Yes, I first looked at openvpn, upgraded it to latest stable (2.0), but nothing changed. The puzzling thing was that other services are running fine over the link - http, imap, smtp. At first I didn't question the kernel on endpoint machines... to 'patch' the link keeping rsync I relaied the traffic through my home notebook: ==[A]-----------(I'net:openvpn:UDP broken for rsync)-----------B== \ / --(I'net:vtund:TCP rsync)--[C]--(I'net:vtund:TCP rsync)-- and that was just fine. So endpoints and openvpn seemed ok. But I had a suspect... kernel on endpoints was 2.4.27 - I seem to recall there was a bug (fixed in 2.4.28) in datagram generator (missing counter, something)... which perhaps over long periods screws up something in the kernel - whatever, fact is that relais VPN with vtund over UDP did not work while over TCP did work fine. Then eventually I upgraded kernel on B and rebooted, but rsync still hanged over openvpn:UDP. Finally upgraded/rebooted A as well, and now rsync over that direct openvpn:UDP works flawlessly, as well as the other services. So, my problem seems solved ... so far :]. I'm not 100% sure it was the kernel / that UDP bug to blame, as other services did run ok anyway, but I would not know were else to look. Indeed, endpoints with 2.4.31-pre1 + openvpn:UDP are up since only ~70h by now, while the problem started after nearly 7 months of continuous operation. (rsync sequences are fired once every 15') So it might be too early to say a definitve word, but if you did not spot anything suspicious in rsync code - or conversely, from above you can exclude for sure a glitch in rsync, I'd rather change status to LATER or REMIND (whatever that means, since they're not defined in Bug's Life - I'm just guessing here).
Since this problem appears to have been caused by a bug in the OS, I'm closing this bug report.
Hi, I have also been bitten by this deadlock bug. Some information on the systems in question: client side (receiving files): rsync 2.6.4 kernel 2.4.25 up-to-date Debian sarge server side (sending files): rsync 2.6.3 kernel 2.4.23 Debian sarge (slightly older) I am rsync'ing over an SSH tunnel from within a Python script. The bug is completely repeatable, appearing each time I ran the script in question. The rsync client command in the script is: rsync -avuz --timeout=300 --rsh="ssh -p 5555 -l root" localhost::aa/*.pt1 . When I add "-vvv", the process hangs at "generate_files phase=1". Some funny stuff I tried: - I created the SSH tunnel from the command line, and typed the above command (except I used a different target directory). It worked. - I reran the Python script (which sets up an identical SSH tunnel and rsyncs based on os.system()). It didn't work. - I could alternate between the above two scenarios ad infinitum. - I edited the script and replaced "*.pt1" with "2005-05-16.pt1" (a file in the same glob that didn't yet exist on the client side). It worked. - When I changed it back to "*.pt1", it didn't work again. - In the script I had another rsync line just below the one given above, with the only difference that it copied "*.wm1" files instead of "*.pt1". I swapped the two lines, and it worked (it copied the *.wm1 files). - When I switched the two lines back, the bug went away (Damn!). Now the "*.pt1" files were also copied. I couldn't yet make the bug reappear again, which is kinda frustrating :-) Hope this helps in some way. Ludwig P.S. I've seen similar reports on the mailing list. Just google for 'rsync hangs "generate_files phase=1"'.
The command-line you give contains "localhost", which conflicts with the description of two hosts being involved. I can't help unless the hang is reproducable, so you may wish to try tweaking some of the destination files and see if you can get it to happen again.
You're using 2.6.3 on one side, with -vvv. Perhaps you've been bitten by the very --verbose bug mentioned in BUGFIX in 2.6.4 rel.notes. Therefore, I'd rather upgrade to at least 2.6.4 and see what happens. BTW it's not that strange you don't see that bug anymore, after you succeded rsyncing once somehow - you've moved the balls, so the game is a bit different now...
Bitten here by what seems to be the same bug. Using rsync 3.0.2 in both Windows 2003 (cwRsync, sender, remote) and OS X 10.5.2 (receiver, local). Command line in OS X: rsync -avhiz --delete --stats -P -v -v winhost:/remotedir/ localdir But I have managed to work around the problem by specifying just one or two files in the remote filespec. That is, specifying for example winhost:/remotedir/file{1,2}.rar. If I made that *.rar it won't work either. (the remote dir also contains files not covered with the *.rar). Problem seems independent of using -v, -vv and -vvv. When using the -vvv, the last line *sometimes* is "generate_files phase=1", but that seems to depend on the number of files asked for. With 3 to say 10 files, the last line usually is something like "match at 16356128 last_match=16356128 j=1633 len=10016 n=0", or rsync's progress display. The larger the number of files, the more probable to get the "generate_files" the line. With 10 files it usually is the last one displayed. Some more context: --the connection uses cygwin's opensshd --it is running over openvpn --when rsync hangs/deadlocks and I ctl-C it, the rsync process in the Windows host remains there, doing nothing. It will not be killed by a normal cygwin kill $pid, but they do die with a kill -SIGKILL $pid. In contrast, when I ctl-C the local rsync while it is working OK, the Windows process goes away. --while rsync is hung/deadlocked, if I kill -SIGKILL the remote windows process, it disappears from Windows' task manager, but the local rsync progress display shows a surge of activity which lasts for about 20 seconds, together with about 2 MBytes worth of received network traffic (so maxing out my 100KBps line). After that, the local rsync complains about connection unexpectedly closed and exits. --I can see no differences between using rsync 3.0.2 and 2.6.9 (cygwin's official) in the Windows side.
Some more data. Right now, while using a remote filespec with 2 files, got rsync seemingly started but the progress display got half-overwritten with the dreaded "generate_files phase=1", and rsync hung. A new trial with the same command line hung showing just the progress line. So, going back to single-file rsync's...
Mijail, you're specifing a ssh-using transfer, unlike the others, who were using a daemon transfer. I have never heard that cygwin has fixed their pipe code, so the hang is undoubtedly on the cygwin side where the pipe between rsync and ssh isn't working right.