The Samba-Bugzilla – Bug 2294
Detect renamed files and handle by renaming instead of delete/re-send
Last modified: 2017-01-22 15:19:34 UTC
It would be nice if rsync could detect identical files with differing names and
just copy/rename the files instead of sending the data all over again.
As I understand it, rsync creates a single long array listing of the filenames
and associated hashes. If it's possible to index on the hashes, cross-checked
with file size, this should be fairly straight-forward, requiring no major
redesign to implement.
The enhancement is easily motivated if you think about what happens when rsync
is used to keep two large servers in sync, and a maintainer renames a top-level
directory on the source machine.
I totally agree this one. With this enhancement there would be no longer
unnecessary traffic when some user has moved / copy'ed a large directory (which
is really annoying).
This is the basic idea behind fuzzy.diff in the patches dir. It does not
currently try to find a basis-file match based on size and mtime (just
similarity of names), but I plan to extend it with that functionality when I fix
some of the patch's other minor problems (see the patch for a list of them).
Note that the --fuzzy patch has made it into the CVS version. It only looks for
renamed files in the same directory as the file being created, though, so it is
not a full solution to files being moved around in the hierarchy, or directory
names changing (that will require a pre-scan on the receiving side, which is not
currently done unless --delete was specified).
I'll leave this open for now as a suggestion for a more extensive rename detector.
There is now a patch named detect_renamed.diff in the patches dir that implements the basics of finding renamed files. This will probably go onto the trunk for the release after 2.6.7.
Thanks. This will be especially useful for log directories where logrotate is incrementing the filename number at each rotation period (httpd.10.gz -> httpd.11.gz).
I'm using rsync 2.6.9 to archive rotated log files to another machine, like Bill wrote. I tried
rsync -avzh --partial --fuzzy src dest
rsync -avzh --partial --delete --fuzzy --delete-after src dest
but both calls always copy all renamed/rotated log files. And of course the files are still in the same directory after being rotated! The logs are very large (several gigs) so it takes too long to be a valuable solution.
Is the patch not included in 2.6.9 or did I miss something?
Add-on question: does rsync switch off -z for .gz files in the affected directory? I think that would be a good idea.
(In reply to comment #6)
> Is the patch not included in 2.6.9 or did I miss something?
Correct, --detect-renamed still exists as a patch; it is not in the main version of rsync.
> Add-on question: does rsync switch off -z for .gz files in the affected
Yes, by default, rsync exempts files with a number of suffixes (including .gz) from -z. Since rsync 3.0.0, you can customize the list of suffixes with --skip-compress=LIST .
(In reply to comment #5)
> Thanks. This will be especially useful for log directories where logrotate is
> incrementing the filename number at each rotation period (httpd.10.gz ->
Since I mentioned this specific use case, I should comment that I recently discovered the 'dateext' option to logrotate which provides a complete workaround in this scenario (which rsync handles perfectly) and might be the better solution for this case in general.
Back on topic, there's still great utility in detecting other rename cases, of course (I often see big .iso's get renamed). I have to admit to having tried the patch, had trouble with short backups, and backed it out without making a good note of specifics. What would be generally useful here for reporting problems against the patch?
I'm interested in this feature so this is a reminder to whoever is involved in this and particularly to Wayne.
Also, I've found the name of the program "Unison" in the context of this issue twice on the mailing list.
*** Bug 6996 has been marked as a duplicate of this bug. ***
Here are some related discussions about this:
Hi, I was about to enter a similar suggestion to this. My very frequent use case is moving files from one directory to another. In that situation the file name does not change--just the directory path leading to it. These are often quite large files (0.2 to several GB) so avoiding re-copying them would speed things up a lot.
As mentionned previously, two patches have been developped (detect-renamed.diff et detect-renamed-lax.diff)
If you want to use these patch on mac OSX, you will find hints here : http://samba.2283325.n4.nabble.com/detect-renamed-for-mac-users-proposition-of-a-modification-td3209591.html
How to apply those 2 detect-renamed* patches? I did
git clone git://git.samba.org/rsync.git
and tried to
patch -p1 <patches/detect-renamed.diff
but that doesn't succeed. Which version would I need to check out to get the patches applied? Sorry, I don't know git.
you don't need git to get the sources : http://samba.anu.edu.au/ftp/rsync/
and choose "rsync-3.0.7.tar.gz" and "rsync-patches-3.0.7.tar.gz"
Damn, that was too easy ;-) Thanks a lot. I'll test the new detect-renamed* patches now.
Has this issue been abandoned? It's been a "while"...
Hey as far I found out there are two patches which still note made it into the last official release?
They are still buggy?
Why didn't it made it to an official release?
9 Years it quite a long time for a possible solution...
I've been playing with the --detect-renamed patch
I can't get seem it to work. Does it rely on other patches?
Anyway, in a simple test, using -vv -a --detect-renamed I can messages about "found renamed", etc, but in a real test, after renaming large directories, there is no speed up. I can only surmise it's not actually renaming.
I have several applications where this would be a very handy feature to have. I don't mind using the patch, if could just get it to work...
Btw, I'm on Mac OS 10.9.2.
There is a bug #8847 in the patchset when partial-dir cannot be created. The fix is described there.
Wow 10 years.
Maybe one reason this has not been implemented is there are other options.
For example I have been using a shell script as a wrapper to reduce the iteration of this bug, here is how it works:
1) Create 2 lists of files; destination and source with the files sizes and path
2) For each file that is in the destination but not the source
3) Create a subset of the source list containing file of the same size
4) If the subset > 0 hash the destination file and each file in the source subset until a match is found
5) Ensure the dir exists on the destination and move/rename the file.
6) On some systems hash can be as expensive as re-transferring the file so I added an option to move the file if there was one match (only sometimes hashing), and another to skip if more(never hashing).
Though as I am re-evaluating my backup strategy I am looking into git-annex and other solutions.
Looking for this capability prior to entering it as an enhancement request myself, I found everything here and basically have the same use case. My version is that I am creating a regular backup of logs from many servers' services onto a single box, and doing so with rsync. Some of those services still do the .1, .2, .3 file rotation, which makes for a lot of needless work, especially when these are 100+ MiB files. It would be great if rsync could detect this to just transfer the new file and rename the old ninety-nine (or however-many).
On Sun, 06 Mar 2016 22:20:16 +0000
> --- Comment #23 from email@example.com ---
> Looking for this capability prior to entering it as an enhancement
> request myself, I found everything here and basically have the same
> use case. My version is that I am creating a regular backup of logs
> from many servers' services onto a single box, and doing so with
> rsync. Some of those services still do the .1, .2, .3 file rotation,
> which makes for a lot of needless work, especially when these are
> 100+ MiB files. It would be great if rsync could detect this to just
> transfer the new file and rename the old ninety-nine (or
It is not so hard to add the following to your logroate.conf.
# Add a date extension instead of just a number for rsync hardlinked
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein
On Sun, 06 Mar 2016 22:20:16 +0000
> --- Comment #23 from firstname.lastname@example.org ---
> Looking for this capability prior to entering it as an enhancement request
> myself, I found everything here and basically have the same use case. My
> version is that I am creating a regular backup of logs from many servers'
> services onto a single box, and doing so with rsync. Some of those services
> still do the .1, .2, .3 file rotation, which makes for a lot of needless work,
> especially when these are 100+ MiB files. It would be great if rsync could
> detect this to just transfer the new file and rename the old ninety-nine (or
Maybe unison could handle such renames better?
### What's the diff between --fuzzy and --detect-renamed ?
If I understand correctly, --fuzzy looks only in destination folder, for either a file that has an identical size and modified-time, or a similarly-named file, and uses it as a basis file.
Whereas --detect-renamed looks everywhere for files that either match in size & modify-time, or match in size & checksum (when --checksum is used), and uses each match as a basis file.
So the main difference is destination_folder_only vs everywhere, am I right ?
### Some questions :
# --fuzzy can be used twice to look in --link-dest folders, useful when backing-up to an empty directory. What about --detect-renamed ?
# Don't these 2 options kill memory when backing-up many many files (furthermore when also looking in --link-dest folders) ? Don't they maintain in-memory list of files ?
# Will these options only do their job when needed (need to find a basis file), or every time ?
# Do these options impact destination performance, or do they benefit from already-done scans ? For example, will -yy scan all --link-dest dirs (disk IO intensive) even if perhaps it's not needed ?
# About --detect-renamed, let's imagine foo/A has been foud in bar/. Will it be smart enough to directly search for foo/B in bar/, instead of restarting a whole lookup ?
# These 2 options use found file as a basis file. Let's imagine the found file totally matches, and we are using --link-dest. Could we think about linking the file instead of copying it ?
# Last, do you have plans for --detect-renamed onto the trunk ?
Thank you very much for this deep-analysis !
I recently ran into the problem that a large file set got renamed and then re-sent. I tried to fix after the fact, so I went the obvious way of comparing sizes and modtimes on the destination and calculate checksums for potential matches. I would have preferred to use a list of inode numbers and files for the old and new file sets instead...
So I wonder whether a different approach to the problem could make sense:
a) the filelist contains inode numbers, and after a successful rsync, a file is generated in the target dir listing inodes and names of all files transferred
b) when receiving to the same dir, if a target file does not exist, the inode in the filelist is used to look up the previous filename. If it exists and matches in size and modtime, it could be hardlinked. Deleting files from the target that are no longer in the source would take care of the old file. When the sync is completed without error, the list of inodes and file names would be updated
c) when receiving in link-dest mode, the file in the old dir would be consulted for a potential match, and the new list would be created in the target dir
Of course this only makes sense if inode numbers are reliable, as on all standard local file systems or nfs. I do not know whether the new storage arenas preserve inodes. It is obvious that the same inode may appear more than once in a source file set