Bug 10290 - Regression since 3.0.9: send_xattr_request: Assertion `f_out >= 0' failed
Regression since 3.0.9: send_xattr_request: Assertion `f_out >= 0' failed
Status: RESOLVED FIXED
Product: rsync
Classification: Unclassified
Component: core
3.1.0
x64 Linux
: P5 regression
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
: 10364 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-24 10:27 UTC by Moritz Bunkus
Modified: 2014-06-22 17:32 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Moritz Bunkus 2013-11-24 10:27:31 UTC
Hey,

I'm running daily backups with dirvish (which uses rsync with --link-dest, very similar to rsnapshot). This creates one directory per day for the machine I'm backing up. It works fine with both rsync 3.0.9 and 3.1.0.

Once a week I also sync the whole machine directory containing those daily directories to an extern hard drive connected via USB. This weekly sync has been working fine for years with rsync 3.0.9. With 3.1.0 I consistently receive the error message mentioned in $subject.

Here's a copy&paste from the rsync invocation including the error message:

[0 root@sweet-chili ~] date ; time ionice -c3 ~mosu/opt/rsync/bin/rsync -haxHAX --delete /backup/ /backup/ext/full1/backup/ ; date
Sa 23. Nov 14:52:35 CET 2013
rsync: xattrs.c:555: send_xattr_request: Assertion `f_out >= 0' failed.
rsync: [receiver] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [receiver=3.1.0dev]
rsync: [sender] write error: Broken pipe (32)
rsync error: error in socket IO (code 10) at io.c(823) [sender=3.1.0dev]
ionice -c3 ~mosu/opt/rsync/bin/rsync -haxHAX --delete /backup/   1116,40s user 4586,98s system 18% cpu 8:37:24,14 total
Sa 23. Nov 23:29:59 CET 2013

I've also spent the last couple of weeks bisecting the exact commit which caused this breakage. That's why the above paste contains that version number. The first git revision to trigger the issue is 78286a03d196efeed754aa4c1faa571f5dbea342.

All following revisions that I've tried (even some from git after the 3.1.0 release) are consistent in that they abort with the aforementioned error.

Some more facts about my system:

* Arch Linux 64bit, updated pretty much weekly
* Source file system is btrfs mounted with »noatime,nodiratime« options
* Destination file system is btrfs mounted with »compress=lzo,noatime,nodiratime« options

I'm willing to provide any further information you may require including re-runs with different options, but please keep in mind that each run takes anywhere between 7 and 12 hours depending on when the previous run was done, how much data has changed in the meantime, and whether or not it aborts.
Comment 1 Wayne Davison 2013-11-25 21:22:10 UTC
Thanks for finding the affecting commit -- very useful.  I'll check into the issue.

You should be able to work around the issue for now by specifying --protocol=30.
Comment 2 Moritz Bunkus 2013-12-02 13:47:18 UTC
I can confirm that the error does indeed not occur with --protocol=30. Anything else I can test for you?
Comment 3 Wayne Davison 2013-12-25 22:30:11 UTC
If you can help with a test case that can reproduce this bug, it would help me to debug it.  Otherwise I may just turn off this hard-linked optimization so that the copying at least doesn't fail.
Comment 4 Moritz Bunkus 2013-12-26 08:41:53 UTC
Before I posted this bug I had already tried to narrow the test down to a smaller subset of files. I hadn't been successful back then, but as each run took quite a lot of hours I hadn't spent too much time on it either.

I'll try again in a couple of days.

Is there a way to let rsync output the file names of each file it works on? Not just the ones that need updating. I'd like to know which file rsync actually chokes on.
Comment 5 Moritz Bunkus 2013-12-31 12:26:56 UTC
Unfortunately I haven't been able to narrow down the test case. I've spent the last couple of days trying to make it as small as possible. The bug occurs when I rsync a whole dirvish host directory (beneath the host directory there is one directory per day when a backup was run, and beneath those is a tree of the whole backed up file system from that day hard-linked to the previous day's backup).

It also happens for the same file each and every time (I've run »rsync -vvhaxHAX --delete src/ dest/« and the last file name output before that assertion has been the same each and every time). Apart from the files having a moderate hard link count (~ 28) I cannot find anything special about it. As a matter of fact it happens with a file from Dropbox' cache directory (~/Dropbox/.dropbox.cache/…).

As soon as I try to narrow down the test case the bug disappears. What I've tried is to move all the .dropbox.cache directories out from their huge dirvish structure into a new directory structure that was laid out the same as the dirvish structure. For example, orignal file structure: /backup/dirvish/hostname/root/20131226/tree/home/mosu/Dropbox/.dropbox.cache, similar names for other days; temporary structure: /temp/backup/dirvish/hostname/root/20131226/tree/home/mosu/Dropbox/.dropbox.cache. I've only moved each and every .dropbox.cache directory there was in the original structure to its place in the temporary structure; no other directories were populated in the temp structure.

I reasoned that those Dropbox cache files should only have hard links to files in the other Dropbox cache directories but not to any normal file outside of those directories.

Unfortunately as soon as I rsync'ed that temporary structure the assertion wasn't triggered.

Now I'm out of ideas of how to narrow it down. If you have any please let me know and I'll give them a try.
Comment 6 Wayne Davison 2014-01-12 21:53:40 UTC
*** Bug 10364 has been marked as a duplicate of this bug. ***
Comment 7 Wayne Davison 2014-01-19 23:24:59 UTC
I've backed-out this change for now since I can't find the bug that is causing this w/o having a test case.  Anyone needing to interact with a buggy 3.1.0 version can make use of the --protocol=30 option to get things going (until it can be updated to the (upcoming) 3.1.1).
Comment 8 Adam Williamson 2014-05-29 18:56:14 UTC
So it looks like a Fedora user hit this bug, and reported it as https://bugzilla.samba.org/show_bug.cgi?id=10364 (which was closed as a dupe of this one) and https://bugzilla.redhat.com/show_bug.cgi?id=1050081 . In response, Fedora's rsync maintainer backported the reversion of 78286a03d196efeed754aa4c1faa571f5dbea342 to our Fedora package, in the build rsync-3.1.0-4 .

Unfortunately this seems to be causing *other* breakage. Fedora live installations now seem to fail consistently with an rsync error - https://bugzilla.redhat.com/show_bug.cgi?id=1101557 . Another reporter notes they are now seeing the same error in a local rsync, and it goes away if they downgrade to rsync 3.1.0-3 - see https://bugzilla.redhat.com/show_bug.cgi?id=1101557#c19 .

I don't know exactly what's going on, but it seems like just reverting the original commit is just exchanging one bug for another?
Comment 9 Wayne Davison 2014-06-08 17:47:51 UTC
I'm glad that change got some testing, because I had missed a hunk in the undo of the prior change (see hash 03bb593f).  I have gone on to make the change in a better fashion so that rsync can figure out how best it should deal with a protocol-31 rsync.  The only rsyncs that will have an issue with this are the 3.1.0pre* versions and any of the fedora-patched versions, but since they have a buggy implementation, that is the best we can do.
Comment 10 Michal Luscon 2014-06-12 14:33:37 UTC
(In reply to comment #9)
> The only rsyncs that will have an issue with this are the
> 3.1.0pre* versions and any of the fedora-patched versions, but since they have
> a buggy implementation, that is the best we can do.

Could you please clarify what you mean by an issue and fedora-patched versions?
Comment 11 Wayne Davison 2014-06-22 17:32:18 UTC
Any rsync with the buggy revert applied won't work right.