I'm getting a deadlock when I'm rsyncing between two local drives. I've attached my strace output. The rsync command is rsync -aAXvvvvi / /mnt/backup --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/lost+found,/var/lib/pacman/sync/*,/home/colin/data/*} --delete
Created attachment 9820 [details] rsync strace output
Also happens with rsync 3.1.1pre1
does it always deadlock at the same file/position ? please check with losf |grep rsync, to see at which file rsync got stuck...
pardon, typo - it should read lsof|grep rsync
lsof | grep rsync is showing no open files
(In reply to comment #5) > lsof | grep rsync is showing no open files I was running lsof as a local user not root. lsof | grep rsync is showing it stuck on the same file after repeated runs. When that file is deleted it ends up getting stuck on a different file. It is the file that the last match_sums was called on.
can you tell if there is some special type of mount involved? what type of filesystem on src / destination ? large files ? for how long did you wait for finish ? ( https://bugzilla.samba.org/show_bug.cgi?id=8315 ) i would try rsync via localhost, i.e. make rsync use the tcpip-stack and perhaps also add bwlimit option, just to see if that makes a difference. maybe we can see if this is an rsync issue or filesystem/disk issue. i would also try another target path just to see how it behaves
(In reply to comment #7) > can you tell if there is some special type of mount involved? It is copying from an ecrypted partition mounted with dm-crypt to another encrypted partition mounted with dm-crypt > > what type of filesystem on src / destination ? Both are btrfs > large files ? Some files are in the 100m range. I'm doing a fully filesystem backup off of a linux machine. > > for how long did you wait for finish ? ( > https://bugzilla.samba.org/show_bug.cgi?id=8315 ) I've waited overnight. > > i would try rsync via localhost, i.e. make rsync use the tcpip-stack and > perhaps also add bwlimit option, just to see if that makes a difference. Will try > > maybe we can see if this is an rsync issue or filesystem/disk issue. > > i would also try another target path just to see how it behaves Like try and backup a portion of the file system?
>Like try and backup a portion of the file system? no, retry with one of the filesystems being standard ext3/4 to see if source or target fs is the culprit. btrfs is still considered experimental, so especially with rsync and dm-crypt i would not wonder if you hit a bug....or some "interference".
It appears that the problem is writing to a partition mounted on top of dm-crypt. I tried btrfs and ext4 on top of dm-crypt and they both deadlocked. Ext4 on top of a raw partition was fine.
I tried with btrfs on top of a raw partition as the destination and it deadlocked.
My apologies, I wrote to the wrong mount. btrfs on a raw partition works fine. So it appears the problem is related to dm-crypt.
can you reproduce the hang without using rsync, e.g. by doing an ordinary cp ?
If I use an ordinary cp -rx everything seems to work.
i have no idea on how to proceed further. if you use an older distro, try reproducing that on a recent one, with recent kernel version. maybe the dm-crypt people have an idea what to do to find the root cause: dm-crypt@saout.de http://www.saout.de/mailman/listinfo/dm-crypt http://news.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt
Hmm, I'm pretty sure the problem appeared sometime in 3.13 since before then rsync was working fine. I'll contact the dm-crypt people and ask for help.
apparently, there seems to exist a deadlock patch for dmcrypt for about 5 months, but it seems it did not enter mainline kernel yet, so probably your distro is also missing it: https://github.com/pld-linux/kernel/blob/master/dm-crypt-fix-allocation-deadlock.patch maybe this one is related and issue which this patch is fixing is not simply theoretical one, but seen in the wild now :) i give Mikulas Patocka a pointer to this report, maybe he can confirm that this patch is related.
Having too much verbosity going is an easy way to cause rsync to hang. If you need it, try using --msgs2stderr so that the protocol doesn't have to deal with all that verbosity. The strace looks like everyone is everyone is trying to write to their pipe/socket file handle at the same time with nobody reading, so the above should get you unstuck. While it would be good to try to fix such a high-verbosity deadlock, it is not something that is easy to do (since there are times that a process must write before doing more reading, and the huge quantity of messages clog things up). If there is some other hang you are experiencing (without the high verbosity), feel free to attach a strace of that run to this bug and re-open.
The extra verbosity was added as an attempt to debug the deadlocks. However as it is not longer hanging without the extra verbosity, it works for me.
what do you mean with "no longer hanging"? you mean you do not use dm-crypt for the target partition anymore and so the problem is solved for you? it would be interesting to find out why the deadlock happens with dm-crypt, though.
> You should press alt-sysrq-w when the deadlock happen to see if there are > any processes deadlocked in the kernel. If yes, send me the stacktrace of > those processes. > > If there are not any processes deadlocked in the kernel, then it may be > userspace problem - bug in rsync or something like that. > > Mikulas http://en.wikipedia.org/wiki/Magic_SysRq_key
I removed the verbosity and it stopped hanging. If I drop back down to -vvvvi again, it does hang. I still use dm-crypt for the partition.
Created attachment 9854 [details] sysrq blocked tasks
Originally it was hanging even without the vvvvi, but since then I've wiped the backup drive completely, and upgraded my kernel from 3.13.6 to 3.14 and it is no longer hanging without the extra verbosity. I'm happy to keep providing traces, but I'm not confident that I'm reproducing the same failure as it no longer hangs without the vvvvi.
so, if kernel update fixed it we should consider this as being resolved