Bug 2957 - rsync hangs when client put files to server (checked in 2.6.6 too)
rsync hangs when client put files to server (checked in 2.6.6 too)
Status: ASSIGNED
Product: rsync
Classification: Unclassified
Component: core
2.6.5
x86 Linux
: P3 normal
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-08-05 04:22 UTC by szczur
Modified: 2008-09-07 23:06 UTC (History)
3 users (show)

See Also:


Attachments
strace -f /usr/local/rsync-2.6.6/bin/rsync (823.75 KB, text/plain)
2005-08-05 04:34 UTC, szczur
no flags Details
Backtrace and syscall of client+server (2.08 KB, text/plain)
2005-10-02 07:10 UTC, bugzilla.samba
no flags Details
Simple shellscript for reproducing the bug (410 bytes, text/plain)
2005-10-10 03:00 UTC, Peter Keitler
no flags Details
rsync server configuration (90 bytes, text/plain)
2005-10-11 07:02 UTC, Peter Keitler
no flags Details
rsync server log (333.77 KB, text/plain)
2005-10-11 07:03 UTC, Peter Keitler
no flags Details
Shellscript using rsync server (437 bytes, text/plain)
2005-10-11 07:10 UTC, Peter Keitler
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description szczur 2005-08-05 04:22:02 UTC
Sometime backup is done, but freqently not.

rsync daemon 2.6.6 is insnstaled @ linux (PLD) with kernel 2.4.31. rsync client
1.6.6 is instaled @ Linux (PLD) with kernel 2.4.28. client connected to server
by 100Mb lan.
Comment 1 szczur 2005-08-05 04:34:47 UTC
Created attachment 1357 [details]
strace -f /usr/local/rsync-2.6.6/bin/rsync
Comment 2 bugzilla.samba 2005-10-01 04:24:55 UTC
Think I am experiencing the same here.

@szczur: Did it work for you with another version of rsync? Are you using an
stunnel between the machines or anything? Is the rsync-receiver or the sender
the problem?
Comment 3 bugzilla.samba 2005-10-02 07:10:23 UTC
Created attachment 1464 [details]
Backtrace and syscall of client+server

This is the backtrace from rsync on client and server using rsync 2.6.6-2 as
currently shipped by Fedora-development - also using the separate
debuginfo-package they provide:
http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/debug/
Comment 4 bugzilla.samba 2005-10-02 07:14:27 UTC
See attached backtrace. Client+server are hung in a select-syscall. The sync is
currently runnign through an stunnel for security-reasons, but a colleague said
the same occured even without stunnel running. Killing the sync and restarting
it a few times, finally makes the sync go through step by step. It seems that 
it has got to do something with the number of files which need to be synced or
similar. The file which is synced when the hang occurs is not "locked" or
anything - it's a static image that was created hours ago in this case.

PS: Please update version-number of this bug to 2.6.6!
Comment 5 Peter Keitler 2005-10-07 07:57:31 UTC
Think, we observe the same problem here.
I implemented a compilerserver since our compiler is dongled.
Source files (mainly *.c, *.h and *.asm) are synced with the server (Win XP,
SP2, Cygwin, rsync 2.6.3) before the Makefile is invoked there via ssh. So far,
everything works fine. Afterwards, the generated files (mainly *.o, *.d, *.lst
and *.hex) shall be synced back to the client (Win XP SP2, Cygwin with rsync 2.6.6)

I tried different versions, ssh tunnel (single :) and rsync server (doble :),
the problem remains the same. The server process hangs. I have to kill it with
'kill -9 <pid> and after that, the client prints the following error message:

rsync: read error: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at
/home/lapo/packaging/tmp/rsync-2.6.6/io.c(584)
rsync: connection unexpectedly closed (8157 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at
/home/lapo/packaging/tmp/rsync-2.6.6/io.c(434)
Error: Communication channel to compilerserver crashed.

The sync finally goes through when I repeat the back-syncing sequence several
times, thus killing the server process and invoking the sync again.
Comment 6 Peter Keitler 2005-10-10 03:00:09 UTC
Created attachment 1495 [details]
Simple shellscript for reproducing the bug

This simple script reproduces the problem reliably (at least on our machines
here).
Once started, the script logs in via ssh to the server and generates there 500
files with a fixed size of 5000 bytes. After that, rsync is invoked to sync
these files with the client machine. The sync-process crashes when it is run
for the second time, thus when files with the same name already exist on the
client.
Strangly enough, the number of the file where the crash occurs is somewhere
between 100 and 200, if I reduce the amount of files to sync from 500 to 200,
however, everything works fine.
Comment 7 bugzilla.samba 2005-10-10 16:36:47 UTC
Tried the shell-script here: Creating files, syncing, creating files with other
content, syncing - all fine. Even with 2500 instead of 500 files. Too bad :-(

Any possibility too track it down further with the backtrace or so?
Comment 8 Peter Keitler 2005-10-11 07:02:38 UTC
Created attachment 1497 [details]
rsync server configuration
Comment 9 Peter Keitler 2005-10-11 07:03:14 UTC
Created attachment 1498 [details]
rsync server log
Comment 10 Peter Keitler 2005-10-11 07:10:01 UTC
Created attachment 1499 [details]
Shellscript using rsync server
Comment 11 Peter Keitler 2005-10-11 07:19:41 UTC
I tried the same script with our Linux server as rsync server (rsync 2.6.0). In
this constellation, everything is ok. So maybe, I am confronted with a pure
Cygwin problem that is different from the original bug 2957 that has been
reported on a Linux system. Do you have the possibility to reproduce it with
Cygwin or shall I report the problem in the cygwin list?
Anyway, the problem is not independent of timing. I wrote yesterday that the
script has to be invoked twice (assuming testdir and its contents don't exist),
thus the files have to be overwritten on the client. I just switched to rsync
server mode (::) and activated debugging messages with '-vvvvvvvvvv'. Now, it
gets stuck already in the first run.

> Any possibility too track it down further with the backtrace or so?

I'm not familiar with cygwin debugging. The cygwin 'ps' command shows an 'O'
which means that rsync gets stuck in an 'Output' operation. 

$ ps
      PID    PPID    PGID     WINPID  TTY  UID    STIME COMMAND
     3796    3196    3796       3364    1 1013 16:15:11 /usr/bin/bash
O    1588    2436    1588       2944    ? 1013 16:16:48 /usr/bin/rsync
     3408    3796    3408       3200    1 1013 16:17:24 /usr/bin/ps

Invoking 'strace' then just shows

$ strace -p 1588
Attached to pid 1588 (windows pid 2944)
   11      11 [unknown (0x108)] rsync 1588 _cygtls::remove: wait 0x0

Don't know if this is useful. Please tell me what to do, if necessary.
Comment 12 Wayne Davison 2005-10-11 09:36:11 UTC
(In reply to comment #1)
> Created an attachment (id=1357) [edit]
> strace -f /usr/local/rsync-2.6.6/bin/rsync 

This strace shows that the sender is just waiting to send some data down the
socket, which means that the receiver hasn't cleared the socket to allow more
data to be written.  Thus, we need to see what the receiver is/was doing.  If
the receiving side is the server, you should run the daemon like this:

  strace -f rsync --daemon --no-detach

If you run that inside a "script" session, all the strace output will be logged.
Comment 13 Wayne Davison 2005-10-11 09:39:44 UTC
(In reply to comment #3)
> Created an attachment (id=1464) [edit]
> Backtrace and syscall of client+server

There are 3 rsync processes involved in a transfer:  two on the receiving side,
and one on the sending side.  We'd need to see strace info for all 3 to get the
full information for what's happening.
Comment 14 Wayne Davison 2005-10-11 09:43:26 UTC
(In reply to comment #11)
> So maybe, I am confronted with a pure
> Cygwin problem that is different from the original bug 2957 that has been
> reported on a Linux system.

Anytime a remote-shell is used in a Cygwin scenario, you are subject to Cygwin's
pipe-data bug where data can potentially be lost.  As soon as they fix this, the
hangs should stop.  Until then, you can use daemon mode directly (not daemon
mode over ssh).
Comment 15 bugzilla.samba 2005-10-11 12:10:03 UTC
Over here no cygwin is involved. Only rsync over an stunnel using xinetd.
Colleague said he even tried that without stunnel, but with no better success.

I did try the test-script as well - but neither with 500 nor 2500 generated
random-files I was able to reproduce the problem "cleanly" unfortunately :-(

I did attach a backtrace and the syscall in which it hangs already. Is more
detailed information needed? Since the problem (at least here) is not cleanly
reproduceable with demo-data unfortunately the _full_ strace-log might include
sensitive data, that I wouldn't appreciate to be online. So in case you could
narrow the problem (strace-output) a bit down or provide some kind of
test-scripts that would really help!
Comment 16 Peter Keitler 2005-10-12 07:55:28 UTC
Seems as using a rsync daemon via TCP solves the problem for me. I have not
encountered any of the afore-mentioned problems so far with it. 

Thank you very much for this information.

Cheers,
Peter
Comment 17 Peter Keitler 2005-10-12 08:56:35 UTC
It really seems to be a cygwin bug. Sorry for not trying TCP before...
Thanx again
Comment 18 bugzilla.samba 2005-10-22 04:38:40 UTC
I'm using it via ssltunnel here - so I suspect iit does neither have to do with
tcp-or-not-tcp nor cygwin (running under Linux). Is there any test-script I can
help you with, or any further information?
Comment 19 Wayne Davison 2005-10-22 10:18:32 UTC
(In reply to comment #18)
> Is there any test-script I can help you with,
> or any further information?

As mentioned in comment #13 (which was a reply to you), we need to see what all
3 processes were doing right before the hang in order to narrow the problem
down.  The 2 processes you traced were both doing what they were supposed to be
doing, so we need to see what the 3rd process was doing.
Comment 20 Wayne Davison 2005-10-22 10:26:38 UTC
Also, don't forget the advice from the first entry on the issues-and-debugging
page about reporting the send/receive queue info from netstat:

http://rsync.samba.org/issues.html

That will also help.
Comment 21 Pablo Marenco 2005-11-02 18:28:18 UTC
Think I am having the same problem. Sometimes the backup is done, sometimes not.
Restarting rsync daemon seems to stops this error from appearing.

I am using cwrsync server and client.
- cwrsync version, 2.6.6 (also happens with 2.6.5).
- rsync daemon running without ssh. No tunnel.
- connected by 100Mb LAN
- rsyncd.conf, module readonly = false

Client script (.bat):
#rsync.exe -va --progress --del --max-delete=50 --link-dest=/%yesterday%/
%sourceunx% rsync://%server%/%module%/%today%

Output:
#rsync: read error: Connection reset by peer (104)
#rsync error: error in rsync protocol data stream code (code 12) at io.c(584)

Comment 22 Brian Essig 2006-10-09 20:16:23 UTC
Hello,

I was receiviong a similar error.  I have put the results of my debug at: http://www.awsolutions.info/rsync-26762.out

Any help would be greatly appreciated.

Thanks,
Brian
Comment 23 Wayne Davison 2006-10-10 19:33:32 UTC
Brian:

You're getting a syntax error.  This bug is about a hang.  I'd suggest posting to the mailing list to get help with the command you're running.
Comment 24 stared 2008-07-11 12:34:38 UTC
Created attachment 3404
Comment 25 John Beshir 2008-09-07 19:51:02 UTC
I believe I have this bug occurring between two particular Linux systems over SSH, pushing the backup, the client running 3.0.3, and the server running 2.6.9. The client performs much of the backup, then proceeds to hang completely.

An strace in the client ends with the following:

select(5, NULL, [4], [4], {60, 0})      = 1 (out [4], left {60, 0})
write(4, "\252{\5\204\361\300\371I?\202\25\322\370\355t#\301\375\222\303O\220\357\251\301<S?\360<\364\312"..., 2295) = 2295
select(5, NULL, [4], [4], {60, 0})      = 1 (out [4], left {60, 0})
write(4, "\377\377\377\377", 4)         = 4
select(6, [5], [], NULL, {60, 0})       = 0 (Timeout)
select(6, [5], [], NULL, {60, 0})       = 0 (Timeout)
select(6, [5], [], NULL, {60, 0})       = 0 (Timeout)
select(6, [5], [], NULL, {60, 0})       = 0 (Timeout)
select(6, [5], [], NULL, {60, 0} <unfinished ...>

Is there anything I can do to assist in tracking down this bug? It's quite the irritant as it breaks my backup scripts for that server.
Comment 26 Matt McCutchen 2008-09-07 20:14:32 UTC
John, it would be helpful to have a stack trace and strace for each of the three processes (sender, generator, and receiver).
Comment 27 John Beshir 2008-09-07 23:06:15 UTC
I wasn't able to get a stack trace, as none of the rsync processes were compiled with debugging information, except the client, which lacked gdb (or an easy method of installing it).

However, I have straces for each. The client is from startup, and the two server processes had strace attached as quickly as possible after startup, at roughly the same time.

Client: http://fudgeman.org/pub/strace.log
Parent process on server: http://fudgeman.org/pub/strace-parent.log
Child process on server: http://fudgeman.org/pub/strace-child.log

The server-side errors were after the client was terminated, after it hung.