Bug 10263 - Extend Behaviour of the --fuzzy Parameter to Consider Directories
Extend Behaviour of the --fuzzy Parameter to Consider Directories
Status: NEW
Product: rsync
Classification: Unclassified
Component: core
3.1.0
All All
: P5 enhancement
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-12 14:29 UTC by Haravikk
Modified: 2013-11-12 14:29 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Haravikk 2013-11-12 14:29:03 UTC
The --fuzzy parameter can be great for speeding up transfers that involve renamed files that haven't changed location, however it isn't as effective if, for example, a directory was renamed, as the entire contents of the directory will end up being copied in full anyway.


What I'd like to propose is an extension of the --fuzzy parameter to also consider directories in the path, possibly with an optional depth to limit how far back it will go.

For example, say I have the following directory with three files:
/foo/bar/A
/foo/bar/B
/foo/bar/C

But in the source this is changed to:
/foo/barred/A
/foo/barred/B
/foo/barred/C

When /foo/barred/A is being considered, rsync will first need to check for an existing directory of the same name, and won't find one (causing it to create one instead). This means that fuzzy matching has no existing directory to compare within, so it should instead look at the next level down (within /foo) and quickly look at existing directories with the same (or similar) creation date. If such a directory is found (/foo/bar) then rsync can look inside this for matches for existing files, possibly using linking for speed.

Linking is obviously preferred, but may depend a lot on whether rsync can detect in advance if the fuzzily matched directory is going to be deleted, as this would mean that linking or moving would be okay to do.


Anyway, this could help to solve or at least limit a common pitfall with synchronisation that arises when a directory is renamed or moved.

i recommend the use of an optional parameter, e.g - --fuzzy-depth to limit how many levels of a path rsync will look for fuzzy matches, though if the behaviour is sensible enough then it may not be necessary. For example,  lets extend the directory above:
/foo/bar/example/folder/A

Which is later renamed to:
/foo/barred/example/folder/A

When considering file A rsync won't find a direct match or anywhere to look for fuzzy matches, so it will look inside /foo/barred/example for a fuzzy match for "folder", again finding nothing. It then tries /foo/barred for a match to "example" but again fails (no directory). Finally it tries inside /foo for a directory with similar creation date to "barred", and will find /foo/bar. Inside this it will find "example", then "folder" then finally a match for file "A".

The issue is how likely it is that a search of /foo will produce multiple fuzzy matches for "barred", but I think the likelihood is low enough that it shouldn't add too much overhead, even with extremely complex hierarchies. The end result should be an improvement in transfer speed for directories that were renamed; won't help much for directories that were moved by any significant amount, but I think renaming is more common.