We've recently noticed some issues regarding servers (and clients alike) that cause either: A) The server to suddenly have a very high cpu load then diminish B) Clients having connection errors and not completing a sync. Here is an example from one of our mirror mailing lists: www-apache/ www-apps/ www-apps/drupal/files/ www-apps/wordpress/files/ www-client/ www-servers/aolserver/files/ www-servers/jboss/files/ www-servers/resin/files/ www-servers/tomcat/files/ x11-base/ x11-libs/ x11-libs/ecore/files/ x11-libs/gtk+/files/ x11-misc/ x11-plugins/ x11-terms/ x11-themes/ x11-wm/ x11-wm/fluxbox/files/ rsync: connection unexpectedly closed (2761332 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(365) From a server point of view, our loads increase dramatically. http://dev.gentoo.org/~ramereth/images/raptor-load.png http://dev.gentoo.org/~ramereth/images/raptor-processes.png While I know most of the spikes occur at the top of the hour (most likely cronned syncs from our users), I normally haven't had nagios alerts pop up warning about loads getting high in the past. I suspect that something is going on with between the client and server which either creates a timeout and uses a lot of cpu, or something else is happening. I should note, that most of our users are probably still using 2.6.0 mainly because when our users were upgraded to anything higher, we had a lot of errors while syncing. Perhaps this is a related issue, I'm not sure about that. There's already a bug open about this in our bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=83254 I have upgraded one of our rsync servers to 2.6.4 to see if the load issue is still there. It seems to have reduced it a little, but its still present. This could be because our users need upgraded as well. Any ideas or recommendations?
Rsync uses a large amount of CPU on the sending side due to the fact that the rsync algorithm is trading CPU and disk I/O to reduce network I/O (and using an encryption algorithm on top of that only makes the CPU that much higher, so using a daemon connection is less CPU intensive). The only way to reduce this is to use the option --whole-file, which makes rsync retransfer each changed file in its entirety rather than try to use CPU to figure out the differences. The gentoo bug you cite seems to be primarily concerned with timeouts, and this is one of the things that 2.6.5 (and to a lesser extent, 2.6.4) tries to fix. You do need to set things up correctly, though: (1) both sides need to be running at least 2.6.4 for any timeout-avoidance to occur (and the server needs to be running 2.6.5 for maximal timeout avoidance); (2) both sides need to know about the timeout, so if the server is a daemon that has a timeout specified in its config file, the client needs to have the same (or lower) timeout set via the --timeout command-line option or the client will not know to send the keep-live packets to the sender.
We at PlanetMirror see this error regularly from our upstream gentoo mirrors, as well as a plethora of other upstream mirrors. Personally, while I hope this fix in 2.6.5 resolves the problems, what you're suggesting here will make a mammoth task for almost every mirror site on the planet. Would it be wise to include sane defaults for both client and server/daemon, such that simply upgrading to 2.6.5 dramatically decreases the problem?
We are seeing this error as well at my workplace. Both sides are using 2.6.5. We are invoking rsync with the following command line. /usr/bin/rsync -e ssh --rsync-path=/usr/bin/rsync -av --recursive --timeout=0 user@host:/path/to/data /path/to/save The error is occuring on hosts where the server side is under high load from the application that is running on that host. We are not running an rsync daemon on the server side. The latest error is: Read from remote host hostname: Connection reset by peer rsync: connection unexpectedly closed (51838271 bytes received so far) [receiver] rsync error: error in rsync protocol data stream (code 12) at io.c(434) rsync: connection unexpectedly closed (34295 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(434)
> Would it be wise to include sane defaults for both client and server/daemon Rsync has always had sane defaults for both client and server/daemon: no timeouts are enabled by default, which means that the connection will continue as long as both sides are present. If you set a timeout in the rsyncd.conf file, I'd recommend that you set it to be quite long -- maybe something like an hour -- just so that it cleans up malicious/buggy connections, but does not interfere with slow transfers.
> The latest error is: > Read from remote host hostname: Connection reset by peer All this tells you is that the connection closed. See the issues/debugging webpage for ways to diagnose what is happening to make the remote end of the connection go away (assuming that it is not a network issue).
These bugs seem to be related (having similar error messages: connection unexpectedly closed, broken pipe, timeout). bug7757 with big file, rsync times out out when it should not; the sender is still responsive bug2783 Random high loads during syncs (server side) / client stream errors rsync: connection unexpectedly closed (2761332 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(365) bug5478 rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32) rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32) io timeout after 30 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(239) [sender=3.0.2] bug5695 improve keep-alive code to handle long-running directory scans ./io.c:void maybe_send_keepalive(void) bug6175 write last transfer status when timeout or other error happens rsync: writefd_unbuffered failed to write 4 bytes [sender]: Broken pipe (32) rsync: connection unexpectedly closed (99113 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.5] bug7195 timeout reached while sending checksums for very large files
I ran into a similar issue recently while transferring large files (>40GB). After a few tests, it seems - in my case at least - to be related to the delta-xfer algorithm. The bug does not happen anymore with the -W option. I don't know if this will resolve your issue, but you can also try looking into these options : --no-checksum --no-compress --blocking-io. These were not the source of my problems, but the functions they're related to might rise a network timeout. I hope it helps, anyways, good luck solving your issue.