2783 – Random high loads during syncs (server side) / client stream errors

Bug 2783 - Random high loads during syncs (server side) / client stream errors

Summary: Random high loads during syncs (server side) / client stream errors

Status:	ASSIGNED

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	2.6.4
Hardware:	All Linux

Importance:	P3 normal (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:	http://bugs.gentoo.org/show_bug.cgi?i...
Keywords:

Depends on:
Blocks:

Reported:	2005-06-09 07:17 UTC by Lance Albertson
Modified:	2012-10-23 11:26 UTC (History)
CC List:	3 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Lance Albertson 2005-06-09 07:17:15 UTC

We've recently noticed some issues regarding servers (and clients alike) that
cause either:

A) The server to suddenly have a very high cpu load then diminish
B) Clients having connection errors and not completing a sync.

Here is an example from one of our mirror mailing lists:
www-apache/
www-apps/
www-apps/drupal/files/
www-apps/wordpress/files/
www-client/
www-servers/aolserver/files/
www-servers/jboss/files/
www-servers/resin/files/
www-servers/tomcat/files/
x11-base/
x11-libs/
x11-libs/ecore/files/
x11-libs/gtk+/files/
x11-misc/
x11-plugins/
x11-terms/
x11-themes/
x11-wm/
x11-wm/fluxbox/files/
rsync: connection unexpectedly closed (2761332 bytes received so far)
[generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(365)

From a server point of view, our loads increase dramatically.
http://dev.gentoo.org/~ramereth/images/raptor-load.png
http://dev.gentoo.org/~ramereth/images/raptor-processes.png

While I know most of the spikes occur at the top of the hour (most likely
cronned syncs from our users), I normally haven't had nagios alerts pop up
warning about loads getting high in the past. I suspect that something is going
on with between the client and server which either creates a timeout and uses a
lot of cpu, or something else is happening.

I should note, that most of our users are probably still using 2.6.0 mainly
because when our users were upgraded to anything higher, we had a lot of errors
while syncing. Perhaps this is a related issue, I'm not sure about that.

There's already a bug open about this in our bugzilla:
https://bugs.gentoo.org/show_bug.cgi?id=83254

I have upgraded one of our rsync servers to 2.6.4 to see if the load issue is
still there. It seems to have reduced it a little, but its still present. This
could be because our users need upgraded as well.

Any ideas or recommendations?

Comment 1 Wayne Davison 2005-06-10 12:40:55 UTC

Rsync uses a large amount of CPU on the sending side due to the fact that the
rsync algorithm is trading CPU and disk I/O to reduce network I/O (and using an
encryption algorithm on top of that only makes the CPU that much higher, so
using a daemon connection is less CPU intensive).  The only way to reduce this
is to use the option --whole-file, which makes rsync retransfer each changed
file in its entirety rather than try to use CPU to figure out the differences.

The gentoo bug you cite seems to be primarily concerned with timeouts, and this
is one of the things that 2.6.5 (and to a lesser extent, 2.6.4) tries to fix.
You do need to set things up correctly, though: (1) both sides need to be
running at least 2.6.4 for any timeout-avoidance to occur (and the server needs
to be running 2.6.5 for maximal timeout avoidance); (2) both sides need to know
about the timeout, so if the server is a daemon that has a timeout specified in
its config file, the client needs to have the same (or lower) timeout set via
the --timeout command-line option or the client will not know to send the
keep-live packets to the sender.

Comment 2 Dan Goodes 2005-07-18 18:08:01 UTC

We at PlanetMirror see this error regularly from our upstream gentoo mirrors, as well as a plethora of 
other upstream mirrors. Personally, while I hope this fix in 2.6.5 resolves the problems, what you're 
suggesting here will make a mammoth task for almost every mirror site on the planet.

Would it be wise to include sane defaults for both client and server/daemon, such that simply upgrading 
to 2.6.5 dramatically decreases the problem?

Comment 3 Paul Varner 2005-08-04 07:58:55 UTC

We are seeing this error as well at my workplace.  Both sides are using 2.6.5.
We are invoking rsync with the following command line.

/usr/bin/rsync -e ssh --rsync-path=/usr/bin/rsync -av --recursive --timeout=0
user@host:/path/to/data /path/to/save

The error is occuring on hosts where the server side is under high load from the
application that is running on that host. We are not running an rsync daemon on
the server side.

The latest error is:
Read from remote host hostname: Connection reset by peer
rsync: connection unexpectedly closed (51838271 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(434)
rsync: connection unexpectedly closed (34295 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(434)

Comment 4 Wayne Davison 2005-08-04 10:17:55 UTC

> Would it be wise to include sane defaults for both client and server/daemon

Rsync has always had sane defaults for both client and server/daemon: no
timeouts are enabled by default, which means that the connection will continue
as long as both sides are present.  If you set a timeout in the rsyncd.conf
file, I'd recommend that you set it to be quite long -- maybe something like an
hour -- just so that it cleans up malicious/buggy connections, but does not
interfere with slow transfers.

Comment 5 Wayne Davison 2005-08-04 10:20:40 UTC

> The latest error is:
> Read from remote host hostname: Connection reset by peer

All this tells you is that the connection closed.  See the issues/debugging
webpage for ways to diagnose what is happening to make the remote end of the
connection go away (assuming that it is not a network issue).

Comment 6 Tim Taiwanese Liim 2010-10-27 00:27:54 UTC

These bugs seem to be related (having similar error messages:
connection unexpectedly closed, broken pipe, timeout).

bug7757
    with big file, rsync times out out when it should not; the 
    sender is still responsive

bug2783
    Random high loads during syncs (server side) / client stream errors
    rsync: connection unexpectedly closed (2761332 bytes received so far)
    [generator]
    rsync error: error in rsync protocol data stream (code 12) at io.c(365)

bug5478
    rsync: writefd_unbuffered failed to write 4092 bytes [sender]: 
                Broken pipe (32)

    rsync: writefd_unbuffered failed to write 4092 bytes [sender]: 
                Broken pipe (32)
    io timeout after 30 seconds -- exiting
    rsync error: timeout in data send/receive (code 30) at io.c(239) 
                [sender=3.0.2]

bug5695
    improve keep-alive code to handle long-running directory scans

    ./io.c:void maybe_send_keepalive(void)

bug6175
    write last transfer status when timeout or other error happens
    rsync: writefd_unbuffered failed to write 4 bytes [sender]: 
           Broken pipe (32)
    rsync: connection unexpectedly closed (99113 bytes received so 
           far) [sender]
    rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.5]

bug7195
    timeout reached while sending checksums for very large files

Comment 7 Loïc Gomez 2012-10-23 11:26:13 UTC

I ran into a similar issue recently while transferring large files (>40GB). After a few tests, it seems - in my case at least - to be related to the delta-xfer algorithm. The bug does not happen anymore with the -W option.

I don't know if this will resolve your issue, but you can also try looking into these options : --no-checksum --no-compress --blocking-io. These were not the source of my problems, but the functions they're related to might rise a network timeout.

I hope it helps, anyways, good luck solving your issue.