Bug 10263 - Extend Behaviour of the --fuzzy Parameter to Consider Directories
Summary: Extend Behaviour of the --fuzzy Parameter to Consider Directories
Status: NEW
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All All
: P5 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-12 14:29 UTC by Haravikk
Modified: 2021-01-15 14:10 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Haravikk 2013-11-12 14:29:03 UTC
The --fuzzy parameter can be great for speeding up transfers that involve renamed files that haven't changed location, however it isn't as effective if, for example, a directory was renamed, as the entire contents of the directory will end up being copied in full anyway.


What I'd like to propose is an extension of the --fuzzy parameter to also consider directories in the path, possibly with an optional depth to limit how far back it will go.

For example, say I have the following directory with three files:
/foo/bar/A
/foo/bar/B
/foo/bar/C

But in the source this is changed to:
/foo/barred/A
/foo/barred/B
/foo/barred/C

When /foo/barred/A is being considered, rsync will first need to check for an existing directory of the same name, and won't find one (causing it to create one instead). This means that fuzzy matching has no existing directory to compare within, so it should instead look at the next level down (within /foo) and quickly look at existing directories with the same (or similar) creation date. If such a directory is found (/foo/bar) then rsync can look inside this for matches for existing files, possibly using linking for speed.

Linking is obviously preferred, but may depend a lot on whether rsync can detect in advance if the fuzzily matched directory is going to be deleted, as this would mean that linking or moving would be okay to do.


Anyway, this could help to solve or at least limit a common pitfall with synchronisation that arises when a directory is renamed or moved.

i recommend the use of an optional parameter, e.g - --fuzzy-depth to limit how many levels of a path rsync will look for fuzzy matches, though if the behaviour is sensible enough then it may not be necessary. For example,  lets extend the directory above:
/foo/bar/example/folder/A

Which is later renamed to:
/foo/barred/example/folder/A

When considering file A rsync won't find a direct match or anywhere to look for fuzzy matches, so it will look inside /foo/barred/example for a fuzzy match for "folder", again finding nothing. It then tries /foo/barred for a match to "example" but again fails (no directory). Finally it tries inside /foo for a directory with similar creation date to "barred", and will find /foo/bar. Inside this it will find "example", then "folder" then finally a match for file "A".

The issue is how likely it is that a search of /foo will produce multiple fuzzy matches for "barred", but I think the likelihood is low enough that it shouldn't add too much overhead, even with extremely complex hierarchies. The end result should be an improvement in transfer speed for directories that were renamed; won't help much for directories that were moved by any significant amount, but I think renaming is more common.
Comment 1 Claudius Ellsel 2021-01-15 13:40:13 UTC
This is probably related, maybe even a duplicate of https://bugzilla.samba.org/show_bug.cgi?id=2294.
Comment 2 Haravikk 2021-01-15 14:04:53 UTC
It's certainly similar but I wouldn't say a direct duplicate; 2294 is requesting detection of move/rename *somehow* which is a tricky proposition (especially with rsync defaulting towards incremental send rather than processing everything in advance).

I posted this issue with a specific intention of how to extend the existing --fuzzy parameter (and hopefully clear-ish on what I mean, as I definitely should have re-read it a couple more times before submitting).

If the new behaviour proposed here were implemented it *could* be considered enough of a fix to satisfy 2294 as well though, or at least to satisfy it in cases of incremental sending, as this solution wouldn't require the whole file tree to be established for comparisons to be made for detecting move/renames.
Comment 3 Claudius Ellsel 2021-01-15 14:10:16 UTC
I have to admit that I only skimmed through the description. There is definitely some more to this than in the other bug.

Have a look at https://bugzilla.samba.org/show_bug.cgi?id=2294#c14 though. The patches linked there seem to already implement much of what you suggest (but not things like fine grained control over depth for example). Make sure to click them, they contain a detailed description of what they do.