Bug 14315 - rsync hangs when many errors
Summary: rsync hangs when many errors
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
Depends on:
Reported: 2020-03-05 22:31 UTC by Mark Vitale
Modified: 2020-06-08 15:34 UTC (History)
0 users

See Also:

test program to aid in reproducing the issue (659 bytes, text/x-csrc)
2020-03-05 22:31 UTC, Mark Vitale
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Vitale 2020-03-05 22:31:33 UTC
Created attachment 15843 [details]
test program to aid in reproducing the issue

When performing a local rsync of a large directory (over 10000 files),  it will hang if a large number of errors occur on the target (destination) directory.

I am a support engineer for OpenAFS (openafs.org), and this issue was originally reported by a customer as a possible OpenAFS problem.  This customer observed a hang when rsyncing a large directory into AFS.  I was able to reproduce the problem and demonstrate that the hang is triggered when chown commands, issued by rsync to restore the group of the destination files, failed due to a security feature of AFS that prohibits the owner of a file from changing group ownership.  The large number of resultant errors caused the three rsync processes to stall.

With the help of a colleague, we were able to devise a way to reproduce this hang without requiring an AFS filesystem.  In order to recreate the rsync hang, we need a way to get a large number of errors while performing the rsync from a normal ext4 filesystem.  In our procedure, we simulate these errors by using a small Linux seccomp program to prohibit chgrp/chown syscalls.

1. Login to a linux account that belongs to at least 2 groups.
$ id
uid=1000(mvitale) gid=1000(mvitale) groups=1000(mvitale),10(wheel)

2. Build a program to simulate chown/chgrp errors:
$ sudo yum install libseccomp libseccomp-devel
$ cc -lseccmp seccomp-chown.c -o sec-kill-chown

The source code for seccomp-chown.c is attached to this ticket.

3. Create a large source directory with over 10000 files. 
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 

These files will all have the group ownership of the user's current group.
Any sufficiently large directory should work; it doesn't have to be a git repo.

4. Switch to the alternate group (starts a new shell)
$ newgrp wheel
$ id
uid=1000(mvitale) gid=10(wheel) groups=10(wheel),1000(mvitale)

5. Enable the error generator (this also starts a new shell)
$ ./sec-kill-chown
Running shell. chown() and friends are now unavailable.

6. Create a target directory and run rsync to duplicate the hang.
$ mkdir target
$ cd target
$ rsync -av --delete --log-file=/tmp/rlog.$$ /home/mvitale/linux ./

This should hang after a few seconds.

7. Exit the two shells (seccomp and newgrp)
$ exit
$ exit

I was able to perform a git bisect to isolate the commit that introduced this hang:

d8587b4 Change the msg pipe to use a real multiplexed IO mode for the data that goes from the receiver to the generator.	

The following releases show the problem:  master, 3.1.3, 3.1.2, 3.1.0
Release 3.0.9 and older do not exhibit the problem.

Each of the following workarounds were successful for my customer and in my testing:
- use an older version of rsync  (3.0.9 or older)
- specify rsync option --msgs2stderr
- perform the rsync under a userid with the same group as the source files

Thanks for your consideration, and please let me know if there's anything else I can provide to help.

Mark Vitale
Comment 1 Mark Vitale 2020-03-05 22:37:55 UTC
Sorry, I gave the wrong commit in my report.  I bisected this hang to:

1a2704512a6f6c9bf267042ff8beb50a24e1d057 is the first bad commit
commit 1a2704512a6f6c9bf267042ff8beb50a24e1d057
Author: Wayne Davison <wayned@samba.org>
Date:   Wed Dec 21 08:30:07 2011 -0800

    Improve the handling of verbose/debug messages
Comment 2 Wayne Davison 2020-06-04 22:59:44 UTC
Should be fixed in the latest git version.
Comment 3 Mark Vitale 2020-06-08 15:34:39 UTC
Thank you very much!