Bug 5478 - rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32)
rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32)
Status: REOPENED
Product: rsync
Classification: Unclassified
Component: core
3.0.3
Other Linux
: P3 normal
: ---
Assigned To: Wayne Davison
Rsync QA Contact
https://bugzilla.samba.org/show_bug.c...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-05-21 03:21 UTC by Xuân Baldauf
Modified: 2013-07-10 19:26 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Xuân Baldauf 2008-05-21 03:21:29 UTC
After years, bug 1959 still exists in rsync 3.0.2 :-(

rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32)
io timeout after 30 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(239) [sender=3.0.2]

My rsync options are "--one-file-system --archive --xattrs --delete-before --update --timeout=0".

The job is to backup an entire filesystem onto a USB drive. Generating the directories-list takes nearly an hour. And some time after that, I receive this error.

My working hypothesis is that the receiver-rsync cannot cope with a write() call to need more than 30 seconds, but I'm not sure.
Comment 1 Wayne Davison 2008-05-21 09:14:19 UTC
Bug 1959 has nothing whatsoever to do with an "io timeout".  That is a value specified either by the client or by the daemon.  Since the --timeout=0 value on the client didn't turn off timeouts, I have to imagine that you're using a daemon, and the daemon has a 30-second timeout specified (which the client is not allowed to override).  If so, you must specify the same (or lower, but not 0) timeout to the client for the rsync client to be able to help out with keep-alive activites.  A timeout of 30 seconds is also rather short, so I'd suggest increasing it in the daemon's config.
Comment 2 Xuân Baldauf 2008-05-21 09:24:54 UTC
(In reply to comment #1)
> Bug 1959 has nothing whatsoever to do with an "io timeout".  That is a value
> specified either by the client or by the daemon.  Since the --timeout=0 value
> on the client didn't turn off timeouts, I have to imagine that you're using a
> daemon,

No, I'm not using a daemon. I'm using a foreground rsync command, where both the source and the destination are directories in the local filesystem.

> and the daemon has a 30-second timeout specified (which the client is
> not allowed to override).  If so, you must specify the same (or lower, but not
> 0) timeout to the client for the rsync client to be able to help out with
> keep-alive activites.  A timeout of 30 seconds is also rather short, so I'd
> suggest increasing it in the daemon's config.
> 

Comment 3 Xuân Baldauf 2008-05-21 09:30:41 UTC
My intuition says that the bug more lies in "io.c" ( http://git.samba.org/?p=rsync.git;a=blob;f=io.c;hb=73cb6738b33846130c21f2903b2200fa3f1903ab ):


#define SELECT_TIMEOUT 30

...

void set_io_timeout(int secs)
{
	io_timeout = secs;

	if (!io_timeout || io_timeout > SELECT_TIMEOUT)
		select_timeout = SELECT_TIMEOUT;
	else
		select_timeout = io_timeout;

	allowed_lull = read_batch ? 0 : (io_timeout + 1) / 2;
}

"secs" has only an effect if (secs!=0)&&(secs<=30)

Thus, supplying a timeout greater 30 or a timeout of 0 actually does not have any effect.
Comment 4 Wayne Davison 2008-05-21 09:55:13 UTC
Interesting.  The 30-second timeout is therefore the one turned on in the error handling to try to get a message back from the receiving side.  Usually the receiving side vanishes before the end of the 30 seconds, so it is unusual to see that error in this context.

You will need to diagnose why one of the receiving processes (presumably the receiver, not the generator) is vanishing.  I'd suggest attaching strace (strace -o FILENAME -p PID) and/or enabling core dumps (ulimit -c unlimited).
Comment 5 Wayne Davison 2008-07-28 20:43:25 UTC
Any more info on this?  You could try the latest 3.1.0dev version and try using the --msgs2stderr option to see if it reveals a remote error that was getting lost when things closed down.
Comment 6 Tim Taiwanese Liim 2010-10-27 00:28:40 UTC
These bugs seem to be related (having similar error messages:
connection unexpectedly closed, broken pipe, timeout).

bug7757
    with big file, rsync times out out when it should not; the 
    sender is still responsive

bug2783
    Random high loads during syncs (server side) / client stream errors
    rsync: connection unexpectedly closed (2761332 bytes received so far)
    [generator]
    rsync error: error in rsync protocol data stream (code 12) at io.c(365)

bug5478
    rsync: writefd_unbuffered failed to write 4092 bytes [sender]: 
                Broken pipe (32)

    rsync: writefd_unbuffered failed to write 4092 bytes [sender]: 
                Broken pipe (32)
    io timeout after 30 seconds -- exiting
    rsync error: timeout in data send/receive (code 30) at io.c(239) 
                [sender=3.0.2]

bug5695
    improve keep-alive code to handle long-running directory scans

    ./io.c:void maybe_send_keepalive(void)

bug6175
    write last transfer status when timeout or other error happens
    rsync: writefd_unbuffered failed to write 4 bytes [sender]: 
           Broken pipe (32)
    rsync: connection unexpectedly closed (99113 bytes received so 
           far) [sender]
    rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.5]

bug7195
    timeout reached while sending checksums for very large files
Comment 7 Eric Shubert 2011-09-20 01:16:05 UTC
I've been running 3.0.7 setup for months with no problem, until June 2011 when this bug showed up. I subsequently upgraded to 3.0.8, but the error persists.

Seems to be triggered by certain file content. The error happened most frequently with files that were individual email messages (Maildir format) containing attached pictures. When I remove the offending source file and restart, more files are processed ok until another offending file is reached. If I restart w/out removing the offending file, it always gets stuck on the same file until it's removed from the list to be processed.

Then it happened with a quickbooks backup (.qbb) file as well. My guess is that there's something in the data stream that's causing something i/o wise to seize up.

I have kept examples of files that cause this. I set up the rsync-debug script on the target server (not running daemon mode, just ssh), and the strace around the error at the end is such:
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "H\10\26}\23565\6\300RV\331\23\262M\357Q\210ngTQ\373\3523\326\6\36
2\246\"\1\375"..., 4092) = 4092
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "\374\17\0\7", 4)         = 4
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "+R~\344\374I\204\344_<\311\235\307\340iT|q\32z=\373(\243\320\332\
32527x\336b"..., 4092) = 4092
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "\374\17\0\7", 4)         = 4
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "\206/\227-W\370+\337\367\251\\h\247\346\330\233\356s\353\254\276\356\365a\206%wS\357\234:\377"..., 4092) = 4092
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "\374\17\0\7", 4)         = 4
31031 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
31031 read(0, "\262\231\222\225%\370\374\262.\262\21A\222V\300\336\0226\262\246\325=*\314\234\314\204Z[(\351w"..., 4092) = 4092
31031 select(1, [0], [], NULL, {60, 0} <unfinished ...>
31030 <... select resumed> )            = 0 (Timeout)
31030 select(4, [3], [], NULL, {60, 0} <unfinished ...>
31031 <... select resumed> )            = 0 (Timeout)
31031 select(1, [0], [], NULL, {60, 0} <unfinished ...>
31030 <... select resumed> )            = 0 (Timeout)
31030 select(4, [3], [], NULL, {60, 0} <unfinished ...>
31031 <... select resumed> )            = 0 (Timeout)
31031 select(1, [0], [], NULL, {60, 0} <unfinished ...>
31030 <... select resumed> )            = 0 (Timeout)
31030 select(4, [3], [], NULL, {60, 0} <unfinished ...>
31031 <... select resumed> )            = 0 (Timeout)
etc.

If there's anything I can do to help further to get this bug fixed, please let me know.

Thanks.
Comment 8 Tim Taiwanese Liim 2011-09-20 02:12:09 UTC
Re: Comment #7
> The error happened most frequently with files ...  containing
> attached pictures.
Eric,
By any chance do you consider these files "big" files?  I am asking
because I saw similar timeout when sending big files (bug7757,
bug7195).  Also in your next strace output, could you use "-tt" to
show timestamps?  Thanks.
Comment 9 Eric Shubert 2011-09-20 14:42:37 UTC
(In reply to comment #8)
> By any chance do you consider these files "big" files?  I am asking
> because I saw similar timeout when sending big files (bug7757,
> bug7195).

Not really. Most problem files appear to be 20-30MB, but a few are <10M and there's one which is only 25K. I'll keep this in mind though.

> Also in your next strace output, could you use "-tt" to
> show timestamps?  Thanks.

Here's one with timestamps:
1636  07:15:20.138388 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:15:20.138646 read(0, "\374\17\0\7", 4) = 4
1636  07:15:20.138883 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:15:20.139121 read(0, "\211\313Wq\361\315ogBR\371\346\351\327\335\r@\20\373\276\303\372<F\373\222\333\340\353\272\261\3"..., 4092) = 4092
1636  07:15:20.139466 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:15:20.139714 read(0, "\374\17\0\7", 4) = 4
1636  07:15:20.139956 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:15:20.140195 read(0, "\307\371e\245(K\17\351\201O\357\2139\221\322%\345)\201.\252G\364\221\326\256\306u(\2353\346"..., 4092) = 4092
1636  07:15:20.140475 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:15:20.140716 read(0, "\374\17\0\7", 4) = 4
1636  07:15:20.140943 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:15:20.141180 read(0, "\221\367\307\26yg0\31\t\314\250Gr\366e\301\231]\211~\253F-\320\240T\177T.\201\206\341"..., 4092) = 1223
1636  07:15:20.141502 select(1, [0], [], NULL, {60, 0} <unfinished ...>
1635  07:15:51.190750 <... select resumed> ) = 0 (Timeout)
1635  07:15:51.190975 select(4, [3], [], NULL, {60, 0} <unfinished ...>
1636  07:16:20.139240 <... select resumed> ) = 0 (Timeout)
1636  07:16:20.139459 select(1, [0], [], NULL, {60, 0} <unfinished ...>
1635  07:16:51.194224 <... select resumed> ) = 0 (Timeout)
1635  07:16:51.194447 select(4, [3], [], NULL, {60, 0} <unfinished ...>

Thanks for having a look at this Tim.

I'm going to put a problem/test file on the target machine's hdd and continue testing with a local2local transfer. I should have mentioned that the target in this case is a pair of USB2 HDDs in a raid-1 configuration. I seem to recall another posting of this error involved writing to a USB drive as well. I expect at this point that USB might be a contributing factor. I'm stumped though as to why it has worked fine for over year before coming across this problem.
Comment 10 Eric Shubert 2011-09-20 15:39:39 UTC
(In reply to comment #9)
> 
> I'm going to put a problem/test file on the target machine's hdd and continue
> testing with a local2local transfer.

I'm holding off on the local test for the moment. After closer look at the strace, the problem appears to be with the transport/ssh mechanism (input side on the target).

Early on in the strace, after a series (10x) of 4092-byte reads, I see this:
1636  07:12:20.682841 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:12:20.683211 read(0, "\374\17\0\7", 4) = 4
1636  07:12:20.683964 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:12:20.684226 read(0, "\222E\23\272R\224<\256\0I`\365\177\247\204\207Jf\370\277=\\q\24q\357\202\"z\307\207H"..., 4092) = 4092
1636  07:12:20.684653 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:12:20.685065 read(0, "\374\17\0\7", 4) = 4
1636  07:12:20.685291 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
1636  07:12:20.685627 read(0, "v5z\260\341\346\260\335w\22\311P\37|%:2N\2177\304\0264\215#\344\274P^\307u\313"..., 4092) = 1223
1636  07:12:20.685984 select(1, [0], [], NULL, {60, 0} <unfinished ...>
1635  07:12:20.686201 <... getdents64 resumed> /* 2 entries */, 32768) = 48
1635  07:12:20.687710 getdents64(0, /* 0 entries */, 32768) = 0
1635  07:12:20.687953 close(0)          = 0

This looks like the same situation as the timeout, except that there is "getdents64 resumed" instead of "select resumed" after the select statement. Don't know if it's related or not.

I'll keep looking to see what I can tell from the strace.
Comment 11 Eric Shubert 2011-09-22 16:54:14 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > By any chance do you consider these files "big" files?  I am asking
> > because I saw similar timeout when sending big files (bug7757,
> > bug7195).
> 
> Not really. Most problem files appear to be 20-30MB, but a few are <10M and
> there's one which is only 25K. I'll keep this in mind though.
> 

I need to change my answer here. The files that cause it to choke are larger than most, so I would have to say that as a general observation the files that fail are large(r). Not huge, but large.

I also need to correct my earlier statement/thinking. While the process did fail on the same file on consecutive runs, leading me to believe that the content of the file was causing the problem, I've determined that this is not always the case. Subsequent runs have transferred these files successfully, so it's not the content of the data that's a problem.

I've been making many (10s of) runs with strace on the target, and have noticed a pattern, which I'll attempt to describe (not having looked at the source code).

When transmitting these large files, the maximum size read at once is 4092 bytes. In most cases of failure, there are 7 of these 4092 byte chunks, followed by 2 smaller chunks, which vary in size. In the sample below, they are 3699, then 393. Other value pairs I've seen are 3697 and 395, 2514 and 1578, 2758 and 1334, 1039 and 3053, 1085 and 3007, 4087 and 5, etc (all pair sums are 4092). These are followed by another 7 4092 byte chunks, then another 3699 and 393, and so on. When the process hangs, it always times out waiting for the first of the two smaller chunks (after 7 reads of 4092 bytes).

On one failure, all reads were 4092 bytes, with no smaller chunks. In this case, the failure occurred after 7 reads of 4092 bytes which followed a write.

I hope that this gives a clue as to where the problem might be. Please let me know if there's something else I can provide. I'll continue to investigate.

Thanks.
Comment 12 Eric Shubert 2011-09-22 18:35:06 UTC
(In reply to comment #11)
> When the process hangs, it always times out waiting for the
> first of the two smaller chunks (after 7 reads of 4092 bytes).
> 
Correction. It times out on the select function following the successful read of the first of the two smaller chunks:
13018 10:38:47.708267 read(0, "\374\17\0\7", 4) = 4
13018 10:38:47.708522 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13018 10:38:47.708757 read(0, "\6B#\377a\357\223\360\237\251J\2477q\216+\252s\\\363X\223\264\244pC\241y\350\321\240R"..., 4092) = 4092
13018 10:38:47.709034 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13018 10:38:47.709264 read(0, "\374\17\0\7", 4) = 4
13018 10:38:47.709619 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13018 10:38:47.709862 read(0, "`\337\24f\220&\0379\223\"y\340\10\247\240\3702\333\360\311e/\362$\333\215\216r\24sw\361"..., 4092) = 3699
13018 10:38:47.710137 select(1, [0], [], NULL, {60, 0} <unfinished ...>
13017 10:39:22.695271 <... select resumed> ) = 0 (Timeout)
13017 10:39:22.695484 select(4, [3], [], NULL, {60, 0} <unfinished ...>
13018 10:39:47.701076 <... select resumed> ) = 0 (Timeout)
13018 10:39:47.701266 select(1, [0], [], NULL, {60, 0} <unfinished ...>
13017 10:40:22.696550 <... select resumed> ) = 0 (Timeout)

This is what happens in the normal flow:
13018 10:38:47.498236 read(0, "\374\17\0\7", 4) = 4
13018 10:38:47.498460 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13018 10:38:47.498710 read(0, "\266\222\6\301\257q\224\0)\302\363R\273\rJ\264g\235\"0\230\311sq\277\243\260\277\322\32\313\213"..., 4092) = 4092
13018 10:38:47.499589 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13018 10:38:47.499852 read(0, "\374\17\0\7", 4) = 4
13018 10:38:47.500054 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13018 10:38:47.500278 read(0, "z\33\0\276\304\177o\246\274\225gH\30\271|\317\365(h\234\314)\223\306\260?\370\274\355\272@\21"..., 4092) = 3699
13018 10:38:47.500599 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {59, 810000})
13018 10:38:47.698206 read(0, "\274&x\365\1\247\252Y\177\7e\17\227\325\207BL\177{\354_\317\2655\352~\20\303&\240\226\303"..., 393) = 393
1

I'm presently attempting to capture a corresponding strace on the source side to see what's going on there.
Comment 13 Eric Shubert 2011-09-22 19:25:19 UTC
(In reply to comment #12)
> I'm presently attempting to capture a corresponding strace on the source side
> to see what's going on there.

Here are matching straces from the sending and receiving sides.

First the receiving (time on this host is slightly ahead):
13265 12:01:56.418051 read(0, "\374\17\0\7", 4) = 4
13265 12:01:56.418251 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13265 12:01:56.418544 read(0, "\341\2373}P\226ZL\316n\273\220\344\246\r\n\215r\245I\241Z\351\225\273t\224\302p\375\375\316"..., 4092) = 4092
13265 12:01:56.418855 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13265 12:01:56.419119 read(0, "\374\17\0\7", 4) = 4
13265 12:01:56.419319 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
13265 12:01:56.419564 read(0, "vmb/\216L\360\227\343\31&\240x\306[T\376Dx-\354\360v0\263\f0\266#\305b\363"..., 4092) = 4092
13265 12:01:56.421406 select(1, [0], [], NULL, {60, 0} <unfinished ...>
13264 12:02:55.255790 <... select resumed> ) = 0 (Timeout)
13264 12:02:55.256043 select(4, [3], [], NULL, {60, 0} <unfinished ...>
13265 12:02:56.421901 <... select resumed> ) = 0 (Timeout)
13265 12:02:56.422114 select(1, [0], [], NULL, {60, 0} <unfinished ...>
13264 12:03:55.257239 <... select resumed> ) = 0 (Timeout)
13264 12:03:55.257426 select(4, [3], [], NULL, {60, 0} <unfinished ...>
13265 12:03:56.420390 <... select resumed> ) = 0 (Timeout)

Now the sending:
2611  12:01:55.348752 write(4, "\374\17\0\7", 4) = 4
2611  12:01:55.348862 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
2611  12:01:55.348983 write(4, "\17m\304\365\313M\344:\26\253Ogu\311\200Q\2\303\254\373\233T<;\221\351\375\207\264\305\365z"..., 4092) = 4092
2611  12:01:55.349115 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
2611  12:01:55.349236 write(4, "\374\17\0\7", 4) = 4
2611  12:01:55.349344 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
2611  12:01:55.349465 write(4, "\10\2074\tC\36\336r\364\233\0_\336]\311\17\"N\352\22\241O\2450\250\271\305\204Q\202\316\344"..., 4092) = 4092
2611  12:01:55.353308 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
2611  12:01:55.353471 write(4, "\374\17\0\7", 4) = 4
2611  12:01:55.353586 select(5, NULL, [4], [4], {60, 0} <unfinished ...>
2612  12:01:56.089897 <... select resumed> ) = 1 (out [3])
2612  12:01:56.090021 write(3, "\326\n\324q\327dR\6\215\201?\227\234\217K:\221c\336\204|@\273\2431f9.\17KTr"..., 158192) = 97920
2612  12:01:56.091620 select(7, [3], [3], NULL, NULL <unfinished ...>
2611  12:02:55.349271 <... select resumed> ) = 0 (Timeout)
2611  12:02:55.349381 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
2611  12:03:55.347803 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
2611  12:04:55.346412 select(5, NULL, [4], [4], {60, 0} <unfinished ...>
2612  12:05:06.185778 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
2611  12:05:06.185871 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
2612  12:05:06.185926 --- SIGINT (Interrupt) @ 0 (0) ---
2611  12:05:06.186867 --- SIGINT (Interrupt) @ 0 (0) ---
2612  12:05:06.187011 sigreturn( <unfinished ...>

I don't know what's supposed to be happening here, so I'll be grateful for any pointers. I suspect at this point that the problem is with the sending side, but that's not saying much.

Thanks.
Comment 14 Wayne Davison 2011-09-23 02:19:04 UTC
It's a bit unclear what pid corresponds to what function, especially with both sides having 2 pids in the trace output.  What is the command-line used?

So, looking at your trace, it looks like the sender is waiting to write more data (its select lists the write and error FDs).  The receiving side's 2 pids are waiting to read more data.  So, something in between has either failed to deliver the data, or blocked, or something (it would appear).  Are there other processes in between the 2 sides (e.g. ssh) that could have blocked?  Or perhaps a failed pipe (or unix domain socket)?
Comment 15 Eric Shubert 2011-09-23 15:55:27 UTC
(In reply to comment #14)
> It's a bit unclear what pid corresponds to what function, especially with both
> sides having 2 pids in the trace output.  What is the command-line used?

  strace -tt -f -o /tmp/rsync-$$.out rsync \
        --verbose \
        --exclude "stor/hosts" \
        --exclude "var/run" \
        --exclude "var/spool" \
        --rsh="ssh -p $backoff_port" \
        --rsync-path="sudo /usr/local/bin/rsync-debug" \
        --recursive \
        --links \
        --perms \
        --times \
        --owner \
        --group \
        --numeric-ids \
        --xattrs \
        --delete \
        --compress \
        --human-readable \
        --stats \
        --exclude='udat/backup*/*' \
        --exclude='udat/migrate/*' \
        /mnt/stor \
        $userid@$backoff_url:$destdir 2>&1

As you can see, it's pushing from the local host via ssh to the remote host.

> So, looking at your trace, it looks like the sender is waiting to write more
> data (its select lists the write and error FDs).  The receiving side's 2 pids
> are waiting to read more data.  So, something in between has either failed to
> deliver the data, or blocked, or something (it would appear).  Are there other
> processes in between the 2 sides (e.g. ssh) that could have blocked?  Or
> perhaps a failed pipe (or unix domain socket)?

Here is another run, with processes. Both sides are rsync-3.0.8-1.el5.rfx.

First the sending side processes:
root      4921  4914  5 07:38 tty1     00:00:13 strace -tt -f -o /tmp/rsync-4914.out rsync --verbose --exclude stor/hosts --exclude var/run --exclude var/spool --rsh=ssh -p 17113 --rsync-path=sudo /usr/local/bin/rsync-debug --recursive --links --perms --times --owner --group --numeric-ids --xattrs --delete --compress --human-readable --stats --exclude=udat/backup*/* --exclude=udat/migrate/* /mnt/stor r3i@backoff.shubes.net:/home/r3i/backup/tacs-udat
root      4923  4921  7 07:38 tty1     00:00:17 rsync --verbose --exclude stor/hosts --exclude var/run --exclude var/spool --rsh=ssh -p 17113 --rsync-path=sudo /usr/local/bin/rsync-debug --recursive --links --perms --times --owner --group --numeric-ids --xattrs --delete --compress --human-readable --stats --exclude=udat/backup*/* --exclude=udat/migrate/* /mnt/stor r3i@backoff.shubes.net:/home/r3i/backup/tacs-udat
root      4924  4923  0 07:38 tty1     00:00:01 ssh -p 17113 -l r3i backoff.shubes.net sudo /usr/local/bin/rsync-debug --server -vlogtpXrze.isf --delete --numeric-ids . /home/r3i/backup/tacs-udat

The sending side strace:
4923  07:47:21.380396 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
4923  07:47:21.380537 write(4, "4\25\t\20\342Q\353\177A\0039rC\272\301\372\335\364\17?\313g\241\215\313\370%\222s\367F|"..., 4092) = 4092
4923  07:47:21.380683 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
4923  07:47:21.380806 write(4, "\374\17\0\7", 4) = 4
4923  07:47:21.380914 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
4923  07:47:21.381034 write(4, "\351\4n\317\346;f\321Pn>~[\333\337\200\214#\33\233\365\3\5\322P\1\221\364\255\230<\16"..., 4092) = 4092
4923  07:47:21.382583 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
4923  07:47:21.382777 write(4, "\374\17\0\7", 4) = 4
4923  07:47:21.382897 select(5, NULL, [4], [4], {60, 0} <unfinished ...>
4924  07:47:21.499524 <... select resumed> ) = 1 (out [3])
4924  07:47:21.499685 write(3, "D\3143?I\322+\362S\231\311Xe\302\264\1\235|R\231Wyw\214\302\211\206\365^\337;\24"..., 131552) = 66240
4924  07:47:21.500711 select(7, [3], [3], NULL, NULL) = 1 (out [3])
4924  07:47:22.518796 write(3, "\350l\250\255\361YB\241`__\302oM\333\346\247\223\tdcM\203O\217aA\260\t\377gm"..., 98128) = 54720
4924  07:47:22.519874 select(7, [3], [3], NULL, NULL <unfinished ...>
4923  07:48:21.377865 <... select resumed> ) = 0 (Timeout)
4923  07:48:21.377989 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
4923  07:49:21.379500 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)
4923  07:50:21.378390 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)


The receiving side processes:
root     16837 16836  0 07:39 ?        00:00:00 /bin/sh /usr/local/bin/rsync-debug --server -vlogtpXrze.i --delete --numeric-ids . /home/r3i/backup/tacs-udat
root     16848 16837 18 07:39 ?        00:01:03 strace -tt -f -o /tmp/rsync-16837.out rsync --server -vlogtpXrze.i --delete --numeric-ids . /home/r3i/backup/tacs-udat
root     16849 16848 15 07:39 ?        00:00:51 rsync --server -vlogtpXrze.i --delete --numeric-ids . /home/r3i/backup/tacs-udat
root     16850 16849  1 07:39 ?        00:00:04 rsync --server -vlogtpXrze.i --delete --numeric-ids . /home/r3i/backup/tacs-udat

The receiving side strace:
16850 07:47:22.332291 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
16850 07:47:22.332556 read(0, "\374\17\0\7", 4) = 4
16850 07:47:22.332808 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
16850 07:47:22.333036 read(0, "&\f]c\254\306\20\326mK\0\r\365\36k\236IQ\365l\23\337Ne\322p\361\207\237\320'\257"..., 4092) = 4092
16850 07:47:22.333328 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
16850 07:47:22.333664 read(0, "\374\17\0\7", 4) = 4
16850 07:47:22.333867 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
16850 07:47:22.334091 read(0, "\212N@\347\31\261\\\4\271g\216\nQ;\331C\34)\365\321\205\363CW\306\313;\322\0226\242\342"..., 4092) = 4092
16850 07:47:22.335051 select(1, [0], [], NULL, {60, 0} <unfinished ...>
16849 07:48:11.629178 <... select resumed> ) = 0 (Timeout)
16849 07:48:11.629374 select(4, [3], [], NULL, {60, 0} <unfinished ...>
16850 07:48:22.335875 <... select resumed> ) = 0 (Timeout)
16850 07:48:22.336060 select(1, [0], [], NULL, {60, 0} <unfinished ...>


Thanks for looking at this. Please let me know what I can do next to help nail this down.
Comment 16 Eric Shubert 2011-09-28 20:40:15 UTC
I've done some further testing and would like to report my findings.

I tried the rsync to a different target host on a different IP address via a different ISP, and lo and behold, it worked. I conclude from this that the problem is outside of the source host, either in the network transport somewhere or in the the target host.

I have a quickbooks backup (.qbb) file that fails pretty regularly, so I took a closer look at the strace on the target host for this file.
On one failure, I get:
3060  19:43:19.737510 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
3060  19:43:19.737768 read(0, "\252\220\346\2\f\216\t3\\\256(\t\202\3\222\211\16\0246\310H\1\342\202\203\2\374\346\0273l\317"..., 4092) = 4092
3060  19:43:19.738449 select(1, [0], [], NULL, {60, 0} <unfinished ...>
3059  19:43:33.847680 <... select resumed> ) = 0 (Timeout)
3059  19:43:33.847918 select(4, [3], [], NULL, {60, 0} <unfinished ...>
3060  19:44:19.730067 <... select resumed> ) = 0 (Timeout)

On a subsequent run a few minutes later, I also get a failure, but approximately 3 seconds later. Curiously, I see this in the 2nd run at the point of failure in the first run (notice the identical data):
3170  20:08:03.378981 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
3170  20:08:03.379237 read(0, "\252\220\346\2\f\216\t3\\\256(\t\202\3\222\211\16\0246\310H\1\342\202\203\2\374\346\0273l\317"..., 4092) = 4092
3170  20:08:03.379744 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {59, 640000})
3170  20:08:03.728257 read(0, "\374\17\0\7", 4) = 4

The 2nd run continues for 3 seconds, then times out like the previous run:
3170  20:08:07.624052 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60, 0})
3170  20:08:07.624424 read(0, "\3,\234\37J\341\374\240\v\347\227W\30\277\220\342\377\222\v\347\27W\30\277\214\342\377\202\v\347\247\366"..., 4092) = 4092
3170  20:08:07.625000 select(1, [0], [], NULL, {60, 0} <unfinished ...>
3169  20:08:48.205338 <... select resumed> ) = 0 (Timeout)
3169  20:08:48.205547 select(4, [3], [], NULL, {60, 0} <unfinished ...>
3170  20:09:07.627769 <... select resumed> ) = 0 (Timeout)

I thought perhaps the "left {59, 640000}" was a clue given that it happened at the exact same place as a previous failure, but I see the same sort of thing after every 8 reads, so I don't expect it's indicative of a problem.


So now I'm beginning to think that the problem is (indeed) outside of the rsync code. I've examined the 2 target hosts (one that worked and one that doesn't), and tried to determine "what's different?". These hosts are all VMs that are running under VMware Server2. The host that worked is running CentOS 5.6, while the one that's failing is running CentOS 5.4 (450 days uptime). I'm going to upgrade the failing host OS to CentOS 5.7 and see if that changes anything. I know that some versions of glibc (> 2.5-34, < 2.5.58) had mysterious problems with VMware Server 2 (guests terminating for no apparent reason). I'm wondering if there's something else in glibc that's perhaps having a problem with VMware Server 2. (I realize that Server 2 is essentially dead and will be migrating to KVM in the near future, although Server2 has been quite stable for me.)

I'll report my findings when the upgrade is complete.
Any pointers toward how to troubleshoot this thing would be more than welcome.
TIA.
Comment 17 Eric Shubert 2011-09-29 17:16:38 UTC
(In reply to comment #16)
> 
> So now I'm beginning to think that the problem is (indeed) outside of the rsync
> code. I've examined the 2 target hosts (one that worked and one that doesn't),
> and tried to determine "what's different?". These hosts are all VMs that are
> running under VMware Server2. The host that worked is running CentOS 5.6, while
> the one that's failing is running CentOS 5.4 (450 days uptime). I'm going to
> upgrade the failing host OS to CentOS 5.7 and see if that changes anything. I
> know that some versions of glibc (> 2.5-34, < 2.5.58) had mysterious problems
> with VMware Server 2 (guests terminating for no apparent reason). I'm wondering
> if there's something else in glibc that's perhaps having a problem with VMware
> Server 2. (I realize that Server 2 is essentially dead and will be migrating to
> KVM in the near future, although Server2 has been quite stable for me.)
> 

I've upgrade the target host, and still no dice. Same failure. It seems odd to me that the transmission fails at different points in the process, anywhere from 3.5 minutes to 14.75 minutes.

I'm presently thinking that the problem is not with rsync itself, but with some aspect of the network connection. I'm seeing no corresponding ethernet errors on either end though, and the network seems to be fine otherwise.

I'd have to say at this point that the failures appear to be happening only with large files.

I could sure use some pointers with identifying what the cause of this failure is. I have a test file that fails regularly (if not consistently), and should be able to tell easily when it's fixed.
Comment 18 Eric Shubert 2011-09-29 20:39:54 UTC
(In reply to comment #17)
> 
> I'm presently thinking that the problem is not with rsync itself, but with some
> aspect of the network connection. I'm seeing no corresponding ethernet errors
> on either end though, and the network seems to be fine otherwise.
> 
> I'd have to say at this point that the failures appear to be happening only
> with large files.
> 

I've (re)read through all 5 bug reports that Tim identified. A general consensus is that this is happening with relatively large files, and relatively slow writes (ie loaded targets).

Upon closer examination of a pair of straces, I see this on the target side:
6814  11:34:25.018774 read(0, "\3037sE\303{\264\16\377\333\357\375\7\274}1c\21\306\377\215\256\202\7\376\245\26\24\302{\16c"..., 4092) = 4092
6814  11:34:25.020201 select(1, [0], [], NULL, {60, 0} <unfinished ...>
6814  11:35:25.009953 <... select resumed> ) = 0 (Timeout)

I found the corresponding block of data in the strace on the source side:
940   11:33:57.494543 write(4, "\3037sE\303{\264\16\377\333\357\375\7\274}1c\21\306\377\215\256\202\7\376\245\26\24\302{\16c"..., 4092 <unfinished ...>
940   11:33:57.494681 <... write resumed> ) = 4092

I had previously set both hosts to use ntp with the same time server, so their times should be very close.

In the source side strace, I counted ~510 (successful) writes of data between the one listed above, and the point at which the source host timed out:
940   11:34:25.073394 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left {60, 0})
940   11:34:25.073514 write(4, "\374\17\0\7", 4) = 4
940   11:34:25.073620 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)

Notice, the last successful read on the target side happened .06 seconds before the last write on the source side, which is pretty much at the same time. At this point, there were ~510 x 4K blocks of data (or ~2MB) "in the pipe", that had been written, but not read.

So I'm wondering, is it possible that "the pipe" (buffers) get filled up, and when they do, that the select function is unable to handle it? I don't (yet) know how these things work, so I could be totally off base. At the same time, this would seem to explain things. 

I'll continue to poke at this to see what I can come up with. In the meantime, any feedback from Wayne and Tim will be appreciated.
Comment 19 Eric Shubert 2011-10-06 15:28:32 UTC
(In reply to comment #14)
> So, looking at your trace, it looks like the sender is waiting to write more
> data (its select lists the write and error FDs).  The receiving side's 2 pids
> are waiting to read more data.  So, something in between has either failed to
> deliver the data, or blocked, or something (it would appear). Are there other
> processes in between the 2 sides (e.g. ssh) that could have blocked?  Or
> perhaps a failed pipe (or unix domain socket)?

After further testing, I'm tending to agree that the problem lies outside of rsync. Somehow, the select() is getting stuck (not receiving a signal from the kernel?). I'm not yet certain which side gets stuck first, but I would guess that it's the target, given that in one test there was about 2MB of data "in the pipe", that had been written by the source but not read yet by the target.

Working on the idea that the failure happens when the pipe is saturated, I decided to try the bwlimit option to slow things down a bit. The upload speed at the source is limited to about 100KBPS (DSL, shared with other users), so I tried bwlimit=64. This worked noticeably better, although it also failed after a few hours. There may have been usage spikes that maxed out the available bw at the time, but I can't tell if that happened or not. So I reduced it to bwlimit=32, and it's been running for 7 hours now without failure (hasn't finished yet, but I'm optimistic).

So I'm inclined to think that this problem lies in the transport somewhere. It also appears to come to light when the transport is saturated.

Does anyone have any ideas regarding how to go about tracking this down further, or where to go for further help? While this bug is definitely affecting rsync, I think the solution is going to end up being elsewhere, in glibc or the kernel perhaps.
Comment 20 Tim Taiwanese Liim 2011-10-07 21:57:54 UTC
I agree with Wayne and Eric that Eric's issue is outside of rsync,
somewhere in the transport.

Eric,
Have you tried to check the TCP buffers of the ssh process on both
ends?  For example,
    p=192.168.51.98:22
    while true; do date; netstat -tn | grep $p; sleep 1; done
        Proto Recv-Q Send-Q Local Address    Foreign Address     State
        tcp        0  19328 192.168.51.98:22 192.168.51.51:53010 ESTABLISHED
In this example, the sender has 19328 bytes in its TCP sending buffer.


You can also use tcpdump and wireshark to graph how well the tcp pipe
goes:
    # catch 100 bytes of each packet on eth0, write to t.pcap.  We
    # need only first 100 bytes because we care only about the TCP
    # sequence number, but not the actual file content.
    # need root access to sniff packets.
    tcpdump -i eth0 -s 100 -w t.pcap
then feed the trace to wireshark:
    wireshark t.pcap
    # then select menu "Statistics" --> TCP Stream Graph --> 
    # Time-Sequence Graph (Stevens)
With this you can visualize how the TCP flow goes (smooth or stalled
or fluctuates or excessive retries).  With tcpdump from both ends, you
can also check for lost packets.  (A few years ago I worked on a case
of stalled ssh; turned out the NIC firmware was defective, causing
excessive packet loss) Comparison between the two targets (good and
bad one) may show the difference.  Do your two targets run on the same
host machine?  Or host machines of the same configuration (same NIC
etc)?  Could one has bad NIC (eg. working but excessive packet loss in
bursts)?

As Wayne pointed out, there is pipe (or unix domain socket) between
rsync and ssh as well.  I don't know how to track the queue size in
pipe yet, so let's track the known ones (TCP) first.  

> Notice, the last successful read on the target side happened .06
> seconds before the last write on the source side, which is pretty
> much at the same time.
This is an important clue, although I don't know what to make out of
it yet; maybe a few lost tcp acks in a row?

How did your "bwlimit=32" test go?


BTW, I am not rsync developer; I do use rsync a lot and tried to help
when possible, so please do not take my response as official.
Comment 21 Martin Scherbaum 2012-07-13 19:14:03 UTC
Hello together,

although the last posting has been already done a while ago, out of respect for the meticulous work of Eric (and Tim and Wayne) - which was a pleasure to read for me - I'd like to add an observation I made for the very same strange rsync behaviour.

I tried to rsync some 225GB of image data (mainly 8 to 12Mpix JPG files) from my notebook (openSUSE 11.4) over a WLAN (ca. 120MBit raw) to a RaidSonic NAS (connected to my router with 100Mbit cable, has officially a 1GBit interfacw) which writes the data to a RAID1 configuration of 3.5" hard drives.
So, no real large files. But a not so fast NAS.
It hung up again and again with the above mentioned message "rsync: writefd_unbuffered failed to write 4092 bytes [sender..." for different files.

After all the discussion of network matters that I saw from your thread I thought (as I'm working in my LAN) I could turn of openSUSE's firewall for a try. And: since that moment everything runs fine! And even fast. rsync doesn't stall and files arrive properly, independent from size (I also got some larger files).

Although I'm not able to make a deep analysis as the writers before, I hope this observation helps one or the other who googles for the error message...
Comment 22 Gaute Hope 2012-10-01 10:20:42 UTC
(In reply to comment #21)
> After all the discussion of network matters that I saw from your thread I
> thought (as I'm working in my LAN) I could turn of openSUSE's firewall for a
> try. And: since that moment everything runs fine! And even fast. rsync doesn't
> stall and files arrive properly, independent from size (I also got some larger
> files).

I experienced a plethora of errors which all seemed to be related to this one. The server is running on a router and when I turned off the firewall and addressed it on its LAN interface (assuming the outside address also resulted in the rsync connection, through ssh, passing through the firewall) it has been working nicely. If I can address it on LAN and turn the firewall back on everything should be fine.

I noticed that when the error happens client side rsync hangs for the timeout and when checking the remote, before timeout, there is no rsync process running. Perhaps the connection is severed by the firewall. I assume rsync has no mechanism for 'retrying' the connection without restarting the entire process?

- Gaute
Comment 23 Chip Schweiss 2012-10-22 16:23:10 UTC
I came here when I was experiencing this problem.   I later found the problem.

If you are running on Redhat/Centos 5/6, there is a long standing bug in the kernel distributed by Redhat that will cause these broken pipes when rsyncing over a latent network.

There are ack packets that are getting rejected frequently, that should be allow via connection tracking.  Insert a rule before the reject line on the input table:

iptables -I {line} -p tcp --tcp-flags SYN,RST,ACK,FIN ACK -j ACCEPT

do an 'iptables -L -n -v --line-numbers' to figure out which line to insert at.

This affects all TCP connections not just rsync over ssh.
Comment 24 Loïc Gomez 2012-10-23 11:24:52 UTC
I also had the same issue and it was(In reply to comment #11)
> I need to change my answer here. The files that cause it to choke are larger
> than most, so I would have to say that as a general observation the files that
> fail are large(r). Not huge, but large.
> 
> I also need to correct my earlier statement/thinking. While the process did
> fail on the same file on consecutive runs, leading me to believe that the
> content of the file was causing the problem, I've determined that this is not
> always the case. Subsequent runs have transferred these files successfully, so
> it's not the content of the data that's a problem

I ran into a similar issue recently while transferring large files (>40GB). After a few tests, it seems - in my case at least - to be related to the delta-xfer algorithm. The bug does not happen anymore with the -W option.

I don't know if this will resolve your issue, but you can also try looking into these options : --no-checksum --no-compress --blocking-io. These were not the source of my problems, but the functions they're related to might rise a network timeout.

I hope it helps, anyways, good luck solving your issue.
Comment 25 Justin Patrin 2012-10-31 01:32:36 UTC
FWIW, this problem went away for me by just removing -z and adding --no-compress.
Comment 26 Justin Patrin 2012-11-01 19:40:37 UTC
Correction: this problem *mostly* went away by turning off compression. I still get this error, 100% reproducable right around the same point in certain large file transfers. Adding --inplace --partial allowed me to keep retrying the rsync and have it eventually finish.
Comment 27 Sebastian 2012-11-02 09:06:27 UTC
I have a number of setups running backups using rsync to either usb memory sticks or usb external hard drives. I have only encountered this problem on the sites which use usb memory sticks. Would this be related to the sender not being able to cope properly with the slower speed of the receiver - governed by the slower write speed of usb memory sticks?

Anyway, after trying all sorts of things, what has cured it for me is using "--bwlimit". So far, with a value of 1000 to 2000 (KBytes/second) it has stopped the errors completely. Of course, this slows down the backup dramatically - but I've got all night :-) 

I haven't tested to see how far I can push the "--bwlimit" before the errors show up again.

I use the following options "--delete --recursive --timeout=60 --max-delete=200 --links --perms --owner --group"

Hope the above helps somebody.
Comment 28 Scott Wood 2013-05-22 03:43:41 UTC
I believe I am a recent victim of this bug.  I am doing an rsync via ssh of large files (from 4GB, up to a couple hundred GB).  I have the -W and have disabled compression on the file type that is failing most often.

Here is a rough example of the rsync push command:

rsync -avW --timeout=0 --rsync-path='/bin/rsync_3.0.9' --skip-compress=ext -e "ssh -c arcfour -i /home/user/.ssh/CUSTOM_id_rsa" /path/to/bigfile.ext remoteuser@theserver:/path/to/destination

And the connection is terminated with an ACK,RST packet from the server resulting in the following on the client:

Read from remote host theserver: Connection reset by peer
rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
rsync: connection unexpectedly closed (28 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(605) [sender=3.0.9]

I have reproduced the result with a multitude of versions, including 3.0.4, 3.0.7 and 3.0.9 on the client side, and 3.04 and 3.09 on the server side.  I have the following questions:

1) Chip Scwheiss, re: comment 23.  Do you have a RHEL bug reference so I can see if this is resolved, or pursue it further with them under our support contract?
2) What is the consensus on the security implications and effectiveness of Chip's iptables suggestion?
3) If it boils down to needing to --bwlimit, does anyone have any suggestions other than trial and error to calculate the maximum it should be able to handle? As we have many TB to move, this would not be the ideal solution.
Comment 29 Scott Wood 2013-05-22 06:05:43 UTC
As a follow up to my second question, we added the firewall rules to allow SYN,RST,ACK,FIN and ACK traffic form the server in question and it did not solve the problem.
Comment 30 Loïc Gomez 2013-05-22 07:02:09 UTC
Did you try with one of these options : --no-checksum --no-compress --blocking-io ?

Just a check since you have a 4G limit : I suppose you're not rsync-ing to a FAT32 filesystem ?
Comment 31 Scott Wood 2013-05-23 02:21:10 UTC
(In reply to comment #30)
> Did you try with one of these options : --no-checksum --no-compress
> --blocking-io ?
> 
> Just a check since you have a 4G limit : I suppose you're not rsync-ing to a
> FAT32 filesystem ?

Hey Loïc,

I haven't tried the --no-ckecksum or --blocking-io yet.  I'll give them a whirl.  I have tried with the --no-compress, though and that did not resolve the matter.  I'll let you know how it works with combinations of --no-ckecksum and --blocking-io.  Thanks!

Regards,
Scott
Comment 32 Scott Wood 2013-05-23 03:50:20 UTC
I tried the --no-ckecksum or --blocking-io, both to no avail.  I now have an admin at the other end of the connection onboard for troubleshooting.  If/when we find the cause, I'll update the list.
Comment 33 michaeljoya 2013-07-10 19:26:56 UTC
I got this error while creating a backup copy between two drives on the same machine. It seems that most people on this list are wrongly attributing the problem to differing rsync versions on client and server, but it happens in the context of a single machine.

In my own case, it turned out that the destination filesystem was simply full.

I had used an absolute path instead of a relative path in the destination, and I was working in a systemrescue environment whose root fs was completely in memory.
So, when the destination fs decides that it can't accept any more writes, it will also produce this error. When I fixed the path to write to the correct path with a greater capacity, the problem immediately went away.