Bug 2628 - hangs indefinitely at start of phase 1
Summary: hangs indefinitely at start of phase 1
Status: RESOLVED WORKSFORME
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.4
Hardware: x86 Linux
: P3 major (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-19 14:15 UTC by Paolo
Modified: 2008-05-01 09:37 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paolo 2005-04-19 14:15:30 UTC
hello,

I'm using (or trying to use) rsync over an openvpn(8) tunnel over an ADSL link
to sync 2 repositories.
On one end there's a Slackware 10.1, on the other Debian 2.2 and 3.x.
I was using distro-provided rsync versions, then grabbed 2.6.4, then latest CVS
20050418, but still get the problem. I thought it was related to the 
-v + long list bug mentioned in 2.6.4 rel. notes, but seems not so.

What happens is, that rsync just hangs while - I think - transferring dir
listing, till a cron process kills it. 
Seems it doesn't hang always on same point.

I've tried rsync --demon --no-detach --bwlimit 20 -vvv (w/ and w/out bwlimit)
but that issues no log at all, despite -v.
So I run rsync by strace -f, and it just sit on a 

select(6, [5], ...) 

while the client, run with -vvv stops listing files on a line with

generate_files phase=1

and hangs there forever; adding --timeout 90 it exit that lock with

io timeout after 120 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(181)
rsync: connection unexpectedly closed (11628 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(420)

I kept the demon 2.6.4 while the client is currently the CVS version mentioned.

Any workaround? something I'm missing?

thanks
-- paolo
Comment 1 Wayne Davison 2005-04-20 01:11:31 UTC
You didn't mention the client command you're running.

As for the log-data output by a --daemon, did you look in the appropriate syslog
file? or check the rsyncd.conf file to see if it is logging to a file?  You can
also use the "max verbosity" setting to increase the level of verbosity that the
daemon will allow a client to request (since it affects the log size, it
defaults to 1).
Comment 2 Paolo 2005-04-20 07:56:35 UTC
yup, sorry, here it is:

rsync -rlptD -v -u --delete --timeout=90 \
  --exclude-from=/etc/rsync/rsync-exc rsync://server/remotedir /localdir/

I'm using --timeout now to at least let it go on on its own.

I don't have log options in rsyncd.conf. Anyway, I'd expect --no-detach to
enable  logging to console as well. In syslog I see only:

rsync error: timeout in data send/receive (code 30) at io.c(181)

Comment 3 Wayne Davison 2005-04-30 03:17:21 UTC
Q: Does changing the level of verbosity get rid of the hang?  (You are right
that this is not the same hang as the -vvv bug mention in the NEWS file because
that would only occur when pushing files, not pulling.)

Q: What does netstat show for the state of the connection on each host?

Q: What system call is each of the 3 processes stuck on?  (Use strace, truss, or
whatever is appropriate for your OS.)
Comment 4 Paolo 2005-05-11 07:21:56 UTC
sorry 4 late reply, I had to revert back to rsync setup for doing the test.

here's what netstat -tpn|grep 873 shows...

tcp        0   2347 192.168.10.254:1739     192.168.0.2:873        ESTABLISHED 
19437/rsync         
tcp        0   4053 192.168.10.254:1329     192.168.0.2:873        LAST_ACK    
-                   
tcp        0   2588 192.168.10.254:1049     192.168.0.2:873        LAST_ACK    
-

... when client is stuck in ...


[pid 19437] select(4, NULL, [3], NULL, {60, 0}) = 1 (out [3], left {60, 0})
[pid 19437] write(3, "\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1\0\0\20\000".
.., 3700) = 3700
[pid 19437] time(NULL)                  = 1115820917
[pid 19437] select(5, [4], [], NULL, {60, 0} <unfinished ...>
[pid 19470] <... select resumed> )      = 1 (in [3], left {58, 270000})
[pid 19470] read(3, "\230\1\0\7", 4)    = 4
[pid 19470] select(4, [3], [], NULL, {60, 0}) = 1 (in [3], left {60, 0})
[pid 19470] read(3, "\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1\0\0\20\000"..
., 408) = 408
[pid 19470] select(4, [3], [], NULL, {60, 0} <unfinished ...>
[pid 19437] <... select resumed> )      = 0 (Timeout)
[pid 19437] select(5, [4], [], NULL, {60, 0} <unfinished ...>
[pid 19470] <... select resumed> )      = 0 (Timeout)
[pid 19470] select(4, [3], [], NULL, {60, 0}

... and correspondingly server is waiting on:


[pid 23453] gettimeofday({1115820907, 675213}, NULL) = 0
[pid 23453] select(6, [5], [], NULL, {60, 0}) = 1 (in [5], left {54, 470000})
[pid 23453] read(5, "\2\0\0\0\20\0B\0\0\0\20\0d\0\0\0\20\0m\0\0\0\20\0n\0\0"...,
8184) = 210
[pid 23453] select(6, NULL, [5], NULL, {60, 0}) = 1 (out [5], left {60, 0})
[pid 23453] write(5,
"\314\0\0\7\2\0\0\0\20\0B\0\0\0\20\0d\0\0\0\20\0m\0\0\0"..., 208) = 208
[pid 23453] gettimeofday({1115820913, 203228}, NULL) = 0
[pid 23453] select(6, NULL, [5], NULL, {60, 0}) = 1 (out [5], left {60, 0})
[pid 23453] write(5, "\6\0\0\7\263\0\0\0\20\0", 10) = 10
[pid 23453] gettimeofday({1115820913, 205305}, NULL) = 0
[pid 23453] select(6, [5], [], NULL, {60, 0}) = 1 (in [5], left {58, 120000})
[pid 23453] read(5,
"\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1\0\0\20\000"..., 8184) = 1353
[pid 23453] select(6, NULL, [5], NULL, {60, 0}) = 1 (out [5], left {60, 0})
[pid 23453] write(5,
"\230\1\0\7\5\1\0\0\20\0\7\1\0\0\20\0000\1\0\0\20\0001\1"..., 412) = 412
[pid 23453] gettimeofday({1115820915, 84990}, NULL) = 0
[pid 23453] select(6, [5], [], NULL, {60, 0}) = 0 (Timeout)
[pid 23453] select(6, [5], [], NULL, {60, 0}


-- paolo
Comment 5 Paolo 2005-05-11 07:32:15 UTC
in prev. msg forgot to mention that the number of -v seems to have no effect 
- actually that test was run with not a single -v on either side; and, btw, as
noted -v... seems to have no effect at all with --daemon.

-- paolo
Comment 6 Wayne Davison 2005-05-11 11:41:48 UTC
Thanks for the extra info.  The system-call traces show that the receiver is
waiting to read data from the sender, the sender is waiting to read data from
the generator, and the generator is waiting for data from the receiver. 
Combined with your initial report, this would place the code in the generator at
the spot where it had just finished sending all checksum data to the sender,
sent a -1 (for the end of phase 0), and is now waiting around for redo numbers
or the next end-of-phase message from the receiver.

So, the question is where did some data get lost?  The call to get_redo_num()
should have flushed all the buffered output data that the generator had, so it
would be good to verify that the last buffer of data (the 3700 bytes in that
run) contained the 4 bytes of 0xFF at the end.  If so, what is the sender
waiting for?  Did it not receive the data?  Alternately, if the 3700 bytes of
data did not have the -1, why didn't all the generator's data flush?

A comment in your initial report said that you started out using distro-provided
rsync versions -- that makes me wonder what versions of rsync were exhibiting
the hang?  Is there any chance that the openvpn tunnel is failing to flush?  Or
losing data?
Comment 7 Paolo 2005-05-14 17:14:10 UTC
The rsync link started with rsync as in Slackware 9.0, then I kept up with
distro rels and later with cvs, locally compiling werever needed (one end runs
Debian 2.2 - Potato).

Yes, I first looked at openvpn, upgraded it to latest stable (2.0), but nothing
changed. The puzzling thing was that other services are running fine over the 
link - http, imap, smtp.

At first I didn't question the kernel on endpoint machines... to 'patch' the 
link keeping rsync I relaied the traffic through my home notebook:

==[A]-----------(I'net:openvpn:UDP broken for rsync)-----------B==
    \                                                         /
     --(I'net:vtund:TCP rsync)--[C]--(I'net:vtund:TCP rsync)--

and that was just fine. So endpoints and openvpn seemed ok.

But I had a suspect... kernel on endpoints was 2.4.27 - I seem to recall 
there was a bug (fixed in 2.4.28) in datagram generator (missing counter,
something)... which perhaps over long periods screws up something in the 
kernel - whatever, fact is that relais VPN with vtund over UDP did not work 
while over TCP did work fine.

Then eventually I upgraded kernel on B and rebooted, but rsync still hanged
over openvpn:UDP. Finally upgraded/rebooted A as well, and now rsync over
that direct openvpn:UDP works flawlessly, as well as the other services.

So, my problem seems solved ... so far :].

I'm not 100% sure it was the kernel / that UDP bug to blame, as other 
services did run ok anyway, but I would not know were else to look.

Indeed, endpoints with 2.4.31-pre1 + openvpn:UDP are up since only ~70h by 
now, while the problem started after nearly 7 months of continuous operation.
(rsync sequences are fired once every 15')

So it might be too early to say a definitve word, but if you did not spot
anything suspicious in rsync code - or conversely, from above you can exclude
for sure a glitch in rsync, I'd rather change status to LATER or REMIND
(whatever that means, since they're not defined in Bug's Life - I'm just 
guessing here). 
Comment 8 Wayne Davison 2005-05-16 09:33:53 UTC
Since this problem appears to have been caused by a bug in the OS, I'm closing
this bug report.
Comment 9 Ludwig Schwardt 2005-05-16 11:19:35 UTC
Hi,

I have also been bitten by this deadlock bug. Some information on the systems in
question:

client side (receiving files):
rsync 2.6.4
kernel 2.4.25
up-to-date Debian sarge

server side (sending files):
rsync 2.6.3
kernel 2.4.23
Debian sarge (slightly older)

I am rsync'ing over an SSH tunnel from within a Python script. The bug is
completely repeatable, appearing each time I ran the script in question. The
rsync client command in the script is:

rsync -avuz --timeout=300 --rsh="ssh -p 5555 -l root" localhost::aa/*.pt1 .

When I add "-vvv", the process hangs at "generate_files phase=1".

Some funny stuff I tried:

- I created the SSH tunnel from the command line, and typed the above command
(except I used a different target directory). It worked.

- I reran the Python script (which sets up an identical SSH tunnel and rsyncs
based on os.system()). It didn't work.

- I could alternate between the above two scenarios ad infinitum.

- I edited the script and replaced "*.pt1" with "2005-05-16.pt1" (a file in the
same glob that didn't yet exist on the client side). It worked.

- When I changed it back to "*.pt1", it didn't work again.

- In the script I had another rsync line just below the one given above, with
the only difference that it copied "*.wm1" files instead of "*.pt1". I swapped
the two lines, and it worked (it copied the *.wm1 files).

- When I switched the two lines back, the bug went away (Damn!). Now the "*.pt1"
files were also copied.

I couldn't yet make the bug reappear again, which is kinda frustrating :-)

Hope this helps in some way.
Ludwig

P.S. I've seen similar reports on the mailing list. Just google for 'rsync hangs
"generate_files phase=1"'.
Comment 10 Wayne Davison 2005-05-16 12:11:05 UTC
The command-line you give contains "localhost", which conflicts with the
description of two hosts being involved.

I can't help unless the hang is reproducable, so you may wish to try tweaking
some of the destination files and see if you can get it to happen again.
Comment 11 Paolo 2005-05-16 14:53:13 UTC
You're using 2.6.3 on one side, with -vvv. Perhaps you've been bitten by the
very --verbose bug mentioned in BUGFIX in 2.6.4 rel.notes.
Therefore, I'd rather upgrade to at least 2.6.4 and see what happens.
BTW it's not that strange you don't see that bug anymore, after you succeded 
rsyncing once somehow - you've moved the balls, so the game is a bit 
different now...
 
Comment 12 Mijail 2008-05-01 06:45:00 UTC
Bitten here by what seems to be the same bug.
Using rsync 3.0.2 in both Windows 2003 (cwRsync, sender, remote) and OS X 10.5.2 (receiver, local).
Command line in OS X: rsync -avhiz --delete --stats -P -v -v winhost:/remotedir/ localdir

But I have managed to work around the problem by specifying just one or two files in the remote filespec. That is, specifying for example winhost:/remotedir/file{1,2}.rar. If I made that *.rar it won't work either. (the remote dir also contains files not covered with the *.rar).

Problem seems independent of using -v, -vv and -vvv. When using the -vvv, the last line *sometimes* is "generate_files phase=1", but that seems to depend on the number of files asked for. With 3 to say 10 files, the last line usually is something like "match at 16356128 last_match=16356128 j=1633 len=10016 n=0", or rsync's progress display. The larger the number of files, the more probable to get the "generate_files" the line. With 10 files it usually is the last one displayed.

Some more context:
--the connection uses cygwin's opensshd
--it is running over openvpn
--when rsync hangs/deadlocks and I ctl-C it, the rsync process in the Windows host remains there, doing nothing. It will not be killed by a normal cygwin kill $pid, but they do die with a kill -SIGKILL $pid. In contrast, when I ctl-C the local rsync while it is working OK, the Windows process goes away.
--while rsync is hung/deadlocked, if I kill -SIGKILL the remote windows process, it disappears from Windows' task manager, but the local rsync progress display shows a surge of activity which lasts for about 20 seconds, together with about 2 MBytes worth of received network traffic (so maxing out my 100KBps line). After that, the local rsync complains about connection unexpectedly closed and exits.
--I can see no differences between using rsync 3.0.2 and 2.6.9 (cygwin's official) in the Windows side.
Comment 13 Mijail 2008-05-01 07:12:03 UTC
Some more data. Right now, while using a remote filespec with 2 files, got rsync seemingly started but the progress display got half-overwritten with the dreaded "generate_files phase=1", and rsync hung.
A new trial with the same command line hung showing just the progress line.

So, going back to single-file rsync's...
Comment 14 Wayne Davison 2008-05-01 09:37:29 UTC
Mijail, you're specifing a ssh-using transfer, unlike the others, who were using a daemon transfer.  I have never heard that cygwin has fixed their pipe code, so the hang is undoubtedly on the cygwin side where the pipe between rsync and ssh isn't working right.