10527 – Rsync Deadlock when copying files

Bug 10527 - Rsync Deadlock when copying files

Summary: Rsync Deadlock when copying files

Status:	RESOLVED WONTFIX

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	3.1.0
Hardware:	All Linux

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-03-30 09:35 UTC by Colin Rice
Modified:	2014-04-16 20:56 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
rsync strace output (57 bytes, text/plain) 2014-03-30 09:48 UTC, Colin Rice	no flags	Details
sysrq blocked tasks (49 bytes, text/plain) 2014-04-16 19:24 UTC, Colin Rice	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Colin Rice 2014-03-30 09:35:02 UTC

I'm getting a deadlock when I'm rsyncing between two local drives.

I've attached my strace output.

The rsync command is

rsync -aAXvvvvi / /mnt/backup --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/lost+found,/var/lib/pacman/sync/*,/home/colin/data/*} --delete

Comment 1 Colin Rice 2014-03-30 09:48:36 UTC

Created attachment 9820 [details]
rsync strace output

Comment 2 Colin Rice 2014-04-03 03:40:24 UTC

Also happens with rsync 3.1.1pre1

Comment 3 roland 2014-04-08 21:34:43 UTC

does it always deadlock at the same file/position ?

please check with losf |grep rsync, to see at which file rsync got stuck...

Comment 4 roland 2014-04-08 21:42:19 UTC

pardon, typo - it should read   lsof|grep rsync

Comment 5 Colin Rice 2014-04-09 02:53:35 UTC

lsof | grep rsync is showing no open files

Comment 6 Colin Rice 2014-04-10 23:48:08 UTC

(In reply to comment #5)
> lsof | grep rsync is showing no open files
I was running lsof as a local user not root.

lsof | grep rsync is showing it stuck on the same file after repeated runs. When that file is deleted it ends up getting stuck on a different file. It is the file that the last match_sums was called on.

Comment 7 roland 2014-04-12 10:28:30 UTC

can you tell if there is some special type of mount involved?

what type of filesystem on src / destination ?

large files ?

for how long did you wait for finish ? ( https://bugzilla.samba.org/show_bug.cgi?id=8315 )

i would try rsync via localhost, i.e. make rsync use the tcpip-stack and perhaps also add bwlimit option, just to see if that makes a difference.

maybe we can see if this is an rsync issue or filesystem/disk issue.

i would also try another target path just to see how it behaves

Comment 8 Colin Rice 2014-04-12 16:51:44 UTC

(In reply to comment #7)
> can you tell if there is some special type of mount involved?
It is copying from an ecrypted partition mounted with dm-crypt to another encrypted partition mounted with dm-crypt
> 
> what type of filesystem on src / destination ?
Both are btrfs
> large files ?
Some files are in the 100m range. I'm doing a fully filesystem backup off of a linux machine.
> 
> for how long did you wait for finish ? (
> https://bugzilla.samba.org/show_bug.cgi?id=8315 )
I've waited overnight.
> 
> i would try rsync via localhost, i.e. make rsync use the tcpip-stack and
> perhaps also add bwlimit option, just to see if that makes a difference.
Will try
> 
> maybe we can see if this is an rsync issue or filesystem/disk issue.
> 
> i would also try another target path just to see how it behaves
Like try and backup a portion of the file system?

Comment 9 roland 2014-04-12 21:26:37 UTC

>Like try and backup a portion of the file system?

no, retry with one of the filesystems being standard ext3/4 to see if source or target fs is the culprit.

btrfs is still considered experimental, so especially with rsync and dm-crypt i would not wonder if you hit a bug....or some "interference".

Comment 10 Colin Rice 2014-04-13 01:59:27 UTC

It appears that the problem is writing to a partition mounted on top of dm-crypt. I tried btrfs and ext4 on top of dm-crypt and they both deadlocked.

Ext4 on top of a raw partition was fine.

Comment 11 Colin Rice 2014-04-13 05:47:05 UTC

I tried with btrfs on top of a raw partition as the destination and it deadlocked.

Comment 12 Colin Rice 2014-04-13 08:07:19 UTC

My apologies, I wrote to the wrong mount. btrfs on a raw partition works fine. So it appears the problem is related to dm-crypt.

Comment 13 roland 2014-04-13 08:20:50 UTC

can you reproduce the hang without using rsync, e.g. by doing an ordinary cp ?

Comment 14 Colin Rice 2014-04-13 08:43:26 UTC

If I use an ordinary cp -rx everything seems to work.

Comment 15 roland 2014-04-13 09:01:04 UTC

i have no idea on how to proceed further.

if you use an older distro, try reproducing that on a recent one, with recent kernel version.

maybe the dm-crypt people have an idea what to do to find the root cause:

dm-crypt@saout.de
http://www.saout.de/mailman/listinfo/dm-crypt
http://news.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt

Comment 16 Colin Rice 2014-04-13 09:02:32 UTC

Hmm, I'm pretty sure the problem appeared sometime in 3.13 since before then rsync was working fine. I'll contact the dm-crypt people and ask for help.

Comment 17 roland 2014-04-13 09:24:09 UTC

apparently, there seems to exist a deadlock patch for dmcrypt for about 5 months, but it seems it did not enter mainline kernel yet, so probably your distro is also missing it:

https://github.com/pld-linux/kernel/blob/master/dm-crypt-fix-allocation-deadlock.patch

maybe this one is related and issue which this patch is fixing is not simply theoretical one, but seen in the wild now :)

i give Mikulas Patocka a pointer to this report, maybe he can confirm that this patch is related.

Comment 18 Wayne Davison 2014-04-13 16:22:17 UTC

Having too much verbosity going is an easy way to cause rsync to hang.  If you need it, try using --msgs2stderr so that the protocol doesn't have to deal with all that verbosity.

The strace looks like everyone is everyone is trying to write to their pipe/socket file handle at the same time with nobody reading, so the above should get you unstuck.  While it would be good to try to fix such a high-verbosity deadlock, it is not something that is easy to do (since there are times that a process must write before doing more reading, and the huge quantity of messages clog things up).

If there is some other hang you are experiencing (without the high verbosity), feel free to attach a strace of that run to this bug and re-open.

Comment 19 Colin Rice 2014-04-13 20:13:55 UTC

The extra verbosity was added as an attempt to debug the deadlocks. However as it is not longer hanging without the extra verbosity, it works for me.

Comment 20 roland 2014-04-15 21:37:37 UTC

what do you mean with "no longer hanging"? you mean you do not use dm-crypt for the target partition anymore and so the problem is solved for you?

it would be interesting to find out why the deadlock happens with dm-crypt, though.

Comment 21 roland 2014-04-16 18:54:51 UTC

> You should press alt-sysrq-w when the deadlock happen to see if there are 
> any processes deadlocked in the kernel. If yes, send me the stacktrace of 
> those processes.
> 
> If there are not any processes deadlocked in the kernel, then it may be 
> userspace problem - bug in rsync or something like that.
> 
> Mikulas

http://en.wikipedia.org/wiki/Magic_SysRq_key

Comment 22 Colin Rice 2014-04-16 19:19:09 UTC

I removed the verbosity and it stopped hanging. If I drop back down to -vvvvi again, it does hang.

I still use dm-crypt for the partition.

Comment 23 Colin Rice 2014-04-16 19:24:39 UTC

Created attachment 9854 [details]
sysrq blocked tasks

Comment 24 Colin Rice 2014-04-16 20:31:10 UTC

Originally it was hanging even without the vvvvi, but since then I've wiped the backup drive completely, and upgraded my kernel from 3.13.6 to 3.14 and it is no longer hanging without the extra verbosity. I'm happy to keep providing traces, but I'm not confident that I'm reproducing the same failure as it no longer hangs without the vvvvi.

Comment 25 roland 2014-04-16 20:56:50 UTC

so, if kernel update fixed it we should consider this as being resolved