Bug 10581 - --fuzzy-delay and --fuzzy-limit for fuzzy match tuning
Summary: --fuzzy-delay and --fuzzy-limit for fuzzy match tuning
Status: NEW
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All All
: P5 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-05-01 13:57 UTC by Haravikk
Modified: 2014-05-01 13:57 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Haravikk 2014-05-01 13:57:09 UTC
It seems that when backing up folders with a very large number of files, --fuzzy behaves in a sub-optimal fashion, forcing rsync to build a file list for the entire folder if even a single new (sender only) file is encountered, which can completely halt a transfer until all of the folder's contents are known.

To give you a better idea; I have a backup command that I run, but one of the items included in the backup is a huge OS X sparse bundle disk image comprised of some 32,000+ bands all stored within a single folder inside the image bundle.

With --fuzzy disabled, rsync very quickly identifies files that are new or changed and starts sending them in a reasonable amount of time (given how many there are to check). However, with --fuzzy enabled, there is a huge (hours long) delay before a single file gets transferred.

Now, I assume this is because rsync is waiting for the destination file-list to be completed so it can perform fuzzy matching for similar files, however with such large folders this can result in an incredible delay for little gain. Such large folders aren't uncommon for modern disk image formats and also for well-used mail folders, as just two examples. While currently I just run with --fuzzy disabled, I would rather keep it enabled for other folders where the match can help to improve matching against relocated files.


So I'm proposing two new --fuzzy related options as follows:

--fuzzy-limit sets a limit on the size of a folder where fuzzy matching is performed. By setting this to say 500, fuzzy matching can be temporarily disabled for any particularly large folders where the benefits will be far outweighed by the delays. This is the simpler of the two to implement I think. Giving the value as normal will set a limit on folder size at both ends, while setting a value with a plus (e.g - --fuzzy-limit +500) will only test the sender, and a minus will test only the destination.

--fuzzy-delay changes the behaviour of --fuzzy such that any fuzzy matching will be deferred until the file-list for the folder is complete. Instead, updates and deletion checks* will continue normally until the file list for the folder is complete, at which point any pending fuzzy matches are performed, and the updates/deletions continue. *in the case of --delete-during this may result in even more missing potential matches than normal, which is why --fuzzy-delay may not be suitable as default behaviour.


Either of these features should help to greatly optimise the performance of --fuzzy, so that it particularly large directories don't result in a significant drop in performance with fuzzy matching enabled, particularly when there is a difference in speed between devices (e.g - faster sender, slower destination such as a NAS or shared remote host).