Bug 13109 - rsync hangs during transfer of many small files
Summary: rsync hangs during transfer of many small files
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.2
Hardware: x86 Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-29 15:21 UTC by Hansjoerg Lipp
Modified: 2020-06-04 22:05 UTC (History)
2 users (show)

See Also:


Attachments
zip archive of the test case (182.54 KB, application/zip)
2017-11-05 14:21 UTC, Hansjoerg Lipp
no flags Details
Simplified test case (727 bytes, application/x-shellscript)
2017-11-06 21:06 UTC, Hansjoerg Lipp
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hansjoerg Lipp 2017-10-29 15:21:40 UTC
Overview
========

rsync hangs during the transfer of directories containing many small files. The blocked process must be interrupted or killed and the transfer restarted, sometimes many times in a row.

As opposed to my previous bug report _this_ time, there are no hard links required to make rsync hang, only regular files are involved.

How to reproduce
================

[ Using Linux on e.g. ext4, about 3 GiB disk space required ]
############################
mkdir rstest
cd rstest

wget 'http://www.hlipp.de/rs/mktest'
chmod u+x mktest

mkdir files
cd files
wget 'http://www.hlipp.de/rs/1518_0219.jpg_original'
wget 'http://www.hlipp.de/rs/1518_0219.jpg'

cd ..
./mktest 5000
############################

Background: This is based on a larger backup script (which explains the somewhat odd directory names etc.) that often hangs. The error often occurs when a user has many image files in a directory which are geotagged (eventually using exiftool) causing the original files to be renamed to *.jpg_original and a new file with updated EXIF information to be created. As I don't know if it is important that only a part of the file changes, I actually include an example image file into the test case.

The script creates a directory "dst" containing 5000 image files (which represents an old backup before geotagging) and a directory "src" containing the same files renamed to *.jpg_original and additionally 5000 image file with altered EXIF information. Finally,
 rsync -avvHAXSkK --backup --backup-dir="$PWD/X/bak" "$PWD/X/src/." "$PWD/X/dst/."
is executed.


Actual Results
==============

The transfer starts normally (files are backed up and transferred to the destination directory, the log looks as expected) but unexpectedly stops without any message or other hints what is going on. The actual number of files that are transferred vary from system to system. Recent tests stopped after 1193, 1105, 972, and 1266 files.


Expected Results
================

rsync should not block but complete the transfer.


Further information
===================

This problem exists at least for rsync versions 3.1.0 and 3.1.2 for different Linux varieties (at least some OpenSUSE versions and Debian jessie) on x86_64 using various file systems (at least ext4 and xfs).
Comment 1 Hansjoerg Lipp 2017-11-05 14:21:55 UTC
Created attachment 13756 [details]
zip archive of the test case

Attached a zip archive of the test case to make reproducing the problem easier:

unzip rstest.zip  
cd rstest 
./mktest 5000
Comment 2 Hansjoerg Lipp 2017-11-06 21:06:46 UTC
Created attachment 13760 [details]
Simplified test case

I could simplify the test case even further, the attached test script does not even need any example files any more. Simply execute
  ./mktest 5000
on a Linux system. This creates files consisting of few 0-bytes in the manner described above and executes rsync as described.

I can't really help debugging this as I'm not familiar with the code, the communication of the processes appears quite complex. I only got this far (current git):

Commenting out the line
  send_msg((enum msgcode)code, buf, len, 0)
in rwrite() in log.c makes the error go away.

When printing the values iobuf.msg.len, iobuf.msg.len + needed, and iobuf.msg.size in send_msg() in io.c, it can be seen that the hang occurs as soon as (iobuf.msg.len + needed) exceeds iobuf.msg.size (32768), i.e. when perform_io(needed, PIO_NEED_MSGROOM) has to be called.
Comment 4 Wayne Davison 2020-06-04 22:05:58 UTC
This is fixed in the latest git version.