Bug 2208 - cygwin version of rsync sometimes hangs
cygwin version of rsync sometimes hangs
Status: CLOSED LATER
Product: rsync
Classification: Unclassified
Component: core
2.6.3
x86 Windows XP
: P3 normal
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-01-02 17:59 UTC by Michal Mühlpachr
Modified: 2008-10-02 12:46 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michal Mühlpachr 2005-01-02 18:00:01 UTC
client MS Win XP SP2 Cygwin 1.5.12 -> server Linux 2.4.27
both using rsync  version 2.6.3  protocol version 28

clinet command:
rsync.exe -r -t -v "/cygdrive/w/MuhlpachrM/My Documents/" michalm@emil::test

client output:
Password:
building file list ... done
./
Bluetooth/
Default.rdp
... some directories and files ...
Doc/Eurotel Internet.txt
Doc/Exchange_2003_TechNet.ppt
rsync: writefd_unbuffered failed to write 4 bytes: phase "unknown" [sender]:
Connection reset by peer (104)
rsync: read error: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at
/home/lapo/packaging/tmp/rsync-2.6.3/io.c(515)

But server do not hangs connection, it waits for input and then timeout ocur.
Server log:
2005/01/03 00:48:03 [23119] rsyncd version 2.6.3 starting, listening on port 873
2005/01/03 00:48:19 [23121] rsync to test from michalm@muhlpachr-n (192.168.0.3)
2005/01/03 00:48:20 [23121] ./
2005/01/03 00:48:20 [23121] Bluetooth/
... some directories and files ...
2005/01/03 00:48:21 [23121] recv muhlpachr-n [192.168.0.3] test (michalm)
Doc/Eurotel Internet.txt 800
2005/01/03 00:58:21 [23121] rsync error: timeout in data send/receive (code 30)
at io.c(153)
2005/01/03 00:58:21 [23121] rsync: connection unexpectedly closed (25858 bytes
received so far) [generator]
2005/01/03 00:58:21 [23121] rsync error: error in rsync protocol data stream
(code 12) at io.c(359)

Server strace (diffrent session with same client hang):
[pid 23247] write(3, "\6\0\0\tWRAP/\n", 10) = 10
[pid 23247] lstat64("WRAP/book.pdf", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] lstat64("WRAP/tb13.pdf", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] lstat64("WRAP/wrap1c.pdf", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] lstat64("assistance.pdf", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] lstat64("desktop.ini", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] lstat64("map.bat", 0xbfffd810) = -1 ENOENT (No such file or directory)
[pid 23247] lstat64("předání konsolidace ECH FUBAR.doc", 0xbfffd810) = -1 ENOENT
(No such file or directory)
[pid 23247] lstat64("refman-5.0.pdf", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] lstat64("reg.reg", 0xbfffd810) = -1 ENOENT (No such file or directory)
[pid 23247] lstat64("reg.txt", 0xbfffd810) = -1 ENOENT (No such file or directory)
[pid 23247] lstat64("unison-2.10.2-manual.html", 0xbfffd810) = -1 ENOENT (No
such file or directory)
[pid 23247] lstat64("unison-2.10.2-manual.pdf", 0xbfffd810) = -1 ENOENT (No such
file or directory)
[pid 23247] lstat64("valc-4.asx", 0xbfffd810) = -1 ENOENT (No such file or
directory)
[pid 23247] select(4, NULL, [3], NULL, {60, 0}) = 1 (out [3], left {60, 0})
[pid 23247] write(3,
"\10\1\0\7x\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0y\3\0\0"..., 268) = 268
[pid 23247] select(7, [6], [], NULL, {60, 0} <unfinished ...>
[pid 23248] <... select resumed> )      = 0 (Timeout)
[pid 23248] time(NULL)                  = 1104732886
[pid 23248] select(4, [3], [], NULL, {60, 0} <unfinished ...>
[pid 23247] <... select resumed> )      = 0 (Timeout)
[pid 23247] select(7, [6], [], NULL, {60, 0} <unfinished ...>
[pid 23248] <... select resumed> )      = 0 (Timeout)
[pid 23248] time(NULL)                  = 1104732946
... same select/time calls repeated ...
[pid 23248] select(4, [3], [], NULL, {60, 0} <unfinished ...>
[pid 23247] <... select resumed> )      = 0 (Timeout)
[pid 23247] select(7, [6], [], NULL, {60, 0} <unfinished ...>
[pid 23248] <... select resumed> )      = 0 (Timeout)
[pid 23248] time(NULL)                  = 1104733426
[pid 23248] rt_sigaction(SIGUSR1, {SIG_IGN}, {0x8052a70, [USR1],
SA_RESTORER|SA_RESTART, 0x4006a678}, 8) = 0
[pid 23248] rt_sigaction(SIGUSR2, {SIG_IGN}, {0x8052aa0, [USR2],
SA_RESTORER|SA_RESTART, 0x4006a678}, 8) = 0
[pid 23248] unlink("Directory/.ISE Czech organigramm.ppt.KiG34J") = 0
[pid 23248] getpid()                    = 23248
[pid 23248] write(7, "A\0\0\10rsync error: timeout in data"..., 69) = 69

Client exit approx 10 seconds after last file synchronized, server drop
connection and exit after timeout (600 seconds).
Same behaviour (same client error) ocurs while I am using ssh transport.
The file at which transfer hangs is diffrent every time, but the error 
is same.

Thanks for help and fix.
Comment 1 Wayne Davison 2005-01-02 18:37:23 UTC
The error you cite just indicates that the other end of the connection went
away. Your trace tells us that it went away because it timed out, so this looks
like it just boils down to the old issue of the cygwin version of rsync
sometimes hanging (though this is the first I've seen of someone encoutering the
hang when using daemon mode).  I've never been able to duplicate this hang, so
I've never been able to investigate it.

This is the first mention of this hang in bugzilla (previous discussion has been
confined to the mailing list), so this will be the official spot to help out
with the rsync-on-cygwin hanging problem.
Comment 2 Michal Mühlpachr 2005-01-03 03:42:04 UTC
Server side (other end of connection) went away more than 9 minutes later
than client, so server do not close TCP connection (it remains open until
server timeout - tested by netstat). But client reports TCP connection closed
nearly immeditely after last file reported. There is problem at client side,
correct me if I am wrong please.

This client hang can be reproduced anytime on file set I use 
(cca 500 MB of files). I am not able to finish any single rsync transfer.

At diffrent client PC which has same configuration - at least both clients PCs
was istalled from same sourcs and I do not see any diffrences (Win XP SP2, same
version of Cygwin) everything is OK and no hangs happend.

How can I help to investigate bug ?
Comment 3 Brian Mueller 2005-01-18 13:36:49 UTC
I am having a very similar problem.  Possibly my comments can crack this issue.

Using several different servers.  Slackware with rsync 2.6.3 and FreeBSD 5.2.1 
with rsync 2.6.3 and clients (several) running cygwin 1.5.12(0.116/4/2) on 
various versions of windows.

My testbed systems are Windows XP Pro SP2 and a Windows 2003 Server System.  I 
am trying to backup these systems to any one of the rsync servers.

My command on the client is:

C:\rsync\rsync.exe --compress --delete --ignore-errors --recursive --verbose --
times --modify-window=5 --stats --
progress '/cygdrive/C/' 'username@host.server.com::share_name/test_server/test_
set/C/'

What seems to happen is that transfer will stop on the client at some point in 
time (say 2:00am) and then the client and the server will both get errors 
about an hour later.  The client gets

rsync: writefd_unbuffered failed to write 16385 bytes: phase "unknown" 
[sender]: Connection reset by peer (104)
rsync: read error: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) 
at /home/lapo/packaging/tmp/rsync-2.6.3/io.c(515)

The server gets...

rsync: writefd_unbuffered failed to write 43 bytes: phase "unknown" 
[generator]: Broken pipe (32)
rsync error: error in rsync protocol data stream (code 12) at io.c(915)

Looking at these it would seem that at some point the client and server simply 
stop communicating even though they are still connected.  Possibly a network 
hiccup happened in the middle of a transfer, didn't see anything obvious in 
minitoring the nextowrk.  They then enter this "unknown" state and while they 
are there they hang out for 1 hour or so - during this time they appear to 
think they are still connected.  After this time they both die with the above 
errors.

This is not specific to any fileset or any specific file.  It might happen 
after running for several days, or after a few minutes.
Comment 4 BlackB1rd 2005-01-19 03:00:42 UTC
Well I'm not having exactly the same problem, since my win32 clients does rsync
well, but sometimes (not always) hangs after it's finished. So it won't get back
to console after the additional status information has been printed.

I'm using the same rsync batch script on 8 different clients, and only at 2 of
them this problem (sometimes) occur. Unfortunately it's not reproducible on demand.
Comment 5 Bryan White 2005-01-24 20:53:59 UTC
I too am having this problem.  I had been working fine for years.  I just
upgraded to all new hardware.  I went from Win2k to WinXP and from what ever I
had before to current version of cygwin/rsync.

I use this in my edit/compile/test cycle.  The source tree is on the WinXP box.
 I run a script to rsync(on ssh) the tree to the server and then compile with
ssh.  Since upgrading the rsync fails about 90% of the time with:
rsync error: error in rsync protocol data stream (code 12) at
/home/lapo/packaging/tmp/rsync-2.6.3/io.c(515)

The client dies after a few seconds and the server sits there for a while.  The
server is running Fedora Core 1 with rsync 2.5.7.
Comment 6 Bryan White 2005-01-26 05:58:48 UTC
I tried copying rsync.exe from an older but working installation (Win2k).  This
was rsync 2.6.0.  That did not fix the problem.  Tonight I will try copying in
an entire working c:\cygwin tree and see if that helps.

I have determined that the problem only seems to occur when pushing files up to
the server.  I have NOT seen problems doing:
- retrieving new files from the server
- dry runs to see what is different
- upload runs where no files need to be sent.
Comment 7 Nos 2005-02-24 19:09:03 UTC
Hi guys My rSync hangs too, it's a cywin-rsync..
It's a little different,
there is no error message

The following is all that is shown:

receiving file list ... done
zTransfer/html/thumbs/12592.jpg
zTransfer/html/thumbs/12593.jpg
zTransfer/html/thumbs/12594.jpg
<blah blah blah.. >
zTransfer/html/thumbs/12596.jpg
zTransfer/html/thumbs/12597.jpg


Then it just hangs like this, nothing else happens.
The rsync lasts around 10seconds

I don't know what is going on,
rsyncing to other hosts works fine, just this host is acting up.
The host is in a LAN with me and connection is fine.
Firewall is off, 

What else can be the problem?

Any reply will be appreciated
Thanks!
Comment 8 Wayne Davison 2005-02-25 16:58:49 UTC
Has anyone tried commenting out HAVE_SOCKETPAIR from config.h to see if that is
the problem?  I mentioned somewhere that this seems to be the common case of
failure -- it fails for a local transfer and for a transfer going through a
remote shell, both of which use a socketpair() to communicate (unless
HAVE_SOCKETPAIR is undefined, in which case a normal pipe() would be used).
Comment 9 Wayne Davison 2005-03-18 19:39:55 UTC
I'm marking the cygwin hang bugs as "LATER" because this is a bug is in the
cygwin pipe code, so it is outside rsync's control.  We'll revisit this issue
later after we hear that the cygwin code has been fixed.

I wonder if specifying a --bwlimit might work around the problem by ensuring
that the pipes can't fill up enough to deadlock.  While we're waiting for a
cygwin fix, give that a try.
Comment 10 Michal Trojnara 2006-06-29 03:12:02 UTC
(In reply to comment #9)
> I wonder if specifying a --bwlimit might work around the problem by ensuring
> that the pipes can't fill up enough to deadlock.  While we're waiting for a
> cygwin fix, give that a try.

--bwlimit workaround works like a charm!

You can reproduce this problem by sending large file (like 1GB) over a WAN link (like 1Mbps) using rsync on cygwin.
Comment 11 Matt McCutchen 2007-11-20 16:59:47 UTC
I am curious about whether a tweak to the way --bwlimit is implemented (e.g., moving the sleep_for_bwlimit call *before* the write in writefd_unbuffered, or perhaps lowering bwlimit_writemax) would make it a more effective workaround.  I encourage anyone who is running up against the hang to do the rest of us a favor by investigating this.
Comment 12 Tony Garland 2008-10-02 12:46:07 UTC
The --bwlimit work-around did not fix the problem for me. Instead, I found this page:

http://marc-abramowitz.com/archives/2007/10/14/solving-rsync-hangs-with-cygwin/ 

which described running rsync on the client against a TCP daemon service (rather than over ssh) which solved the problem for me. After setting up an rsyncd.conf file and installing the rsyncd daemon as a service on the remote machine:

cygrunsrv --install "rsyncd" --path /usr/bin/rsync --args "--daemon --no-detach"
  --desc "Starts a rsync daemon for accepting incoming rsync connections"
  --disp "Rsync Daemon" --type auto

it worked without a hitch and is no longer stalling on large files.