It would be nice if rsync could detect identical files with differing names and
just copy/rename the files instead of sending the data all over again.
As I understand it, rsync creates a single long array listing of the filenames
and associated hashes. If it's possible to index on the hashes, cross-checked
with file size, this should be fairly straight-forward, requiring no major
redesign to implement.
The enhancement is easily motivated if you think about what happens when rsync
is used to keep two large servers in sync, and a maintainer renames a top-level
directory on the source machine.
I totally agree this one. With this enhancement there would be no longer
unnecessary traffic when some user has moved / copy'ed a large directory (which
is really annoying).
This is the basic idea behind fuzzy.diff in the patches dir. It does not
currently try to find a basis-file match based on size and mtime (just
similarity of names), but I plan to extend it with that functionality when I fix
some of the patch's other minor problems (see the patch for a list of them).
Note that the --fuzzy patch has made it into the CVS version. It only looks for
renamed files in the same directory as the file being created, though, so it is
not a full solution to files being moved around in the hierarchy, or directory
names changing (that will require a pre-scan on the receiving side, which is not
currently done unless --delete was specified).
I'll leave this open for now as a suggestion for a more extensive rename detector.
There is now a patch named detect_renamed.diff in the patches dir that implements the basics of finding renamed files. This will probably go onto the trunk for the release after 2.6.7.
Thanks. This will be especially useful for log directories where logrotate is incrementing the filename number at each rotation period (httpd.10.gz -> httpd.11.gz).
I'm using rsync 2.6.9 to archive rotated log files to another machine, like Bill wrote. I tried
rsync -avzh --partial --fuzzy src dest
rsync -avzh --partial --delete --fuzzy --delete-after src dest
but both calls always copy all renamed/rotated log files. And of course the files are still in the same directory after being rotated! The logs are very large (several gigs) so it takes too long to be a valuable solution.
Is the patch not included in 2.6.9 or did I miss something?
Add-on question: does rsync switch off -z for .gz files in the affected directory? I think that would be a good idea.
(In reply to comment #6)
> Is the patch not included in 2.6.9 or did I miss something?
Correct, --detect-renamed still exists as a patch; it is not in the main version of rsync.
> Add-on question: does rsync switch off -z for .gz files in the affected
Yes, by default, rsync exempts files with a number of suffixes (including .gz) from -z. Since rsync 3.0.0, you can customize the list of suffixes with --skip-compress=LIST .
(In reply to comment #5)
> Thanks. This will be especially useful for log directories where logrotate is
> incrementing the filename number at each rotation period (httpd.10.gz ->
Since I mentioned this specific use case, I should comment that I recently discovered the 'dateext' option to logrotate which provides a complete workaround in this scenario (which rsync handles perfectly) and might be the better solution for this case in general.
Back on topic, there's still great utility in detecting other rename cases, of course (I often see big .iso's get renamed). I have to admit to having tried the patch, had trouble with short backups, and backed it out without making a good note of specifics. What would be generally useful here for reporting problems against the patch?
I'm interested in this feature so this is a reminder to whoever is involved in this and particularly to Wayne.
Also, I've found the name of the program "Unison" in the context of this issue twice on the mailing list.
*** Bug 6996 has been marked as a duplicate of this bug. ***
Here are some related discussions about this:
Hi, I was about to enter a similar suggestion to this. My very frequent use case is moving files from one directory to another. In that situation the file name does not change--just the directory path leading to it. These are often quite large files (0.2 to several GB) so avoiding re-copying them would speed things up a lot.
As mentionned previously, two patches have been developped (detect-renamed.diff et detect-renamed-lax.diff)
If you want to use these patch on mac OSX, you will find hints here : http://samba.2283325.n4.nabble.com/detect-renamed-for-mac-users-proposition-of-a-modification-td3209591.html
How to apply those 2 detect-renamed* patches? I did
git clone git://git.samba.org/rsync.git
and tried to
patch -p1 <patches/detect-renamed.diff
but that doesn't succeed. Which version would I need to check out to get the patches applied? Sorry, I don't know git.
you don't need git to get the sources : http://samba.anu.edu.au/ftp/rsync/
and choose "rsync-3.0.7.tar.gz" and "rsync-patches-3.0.7.tar.gz"
Damn, that was too easy ;-) Thanks a lot. I'll test the new detect-renamed* patches now.
Has this issue been abandoned? It's been a "while"...
Hey as far I found out there are two patches which still note made it into the last official release?
They are still buggy?
Why didn't it made it to an official release?
9 Years it quite a long time for a possible solution...
I've been playing with the --detect-renamed patch
I can't get seem it to work. Does it rely on other patches?
Anyway, in a simple test, using -vv -a --detect-renamed I can messages about "found renamed", etc, but in a real test, after renaming large directories, there is no speed up. I can only surmise it's not actually renaming.
I have several applications where this would be a very handy feature to have. I don't mind using the patch, if could just get it to work...
Btw, I'm on Mac OS 10.9.2.
There is a bug #8847 in the patchset when partial-dir cannot be created. The fix is described there.
Wow 10 years.
Maybe one reason this has not been implemented is there are other options.
For example I have been using a shell script as a wrapper to reduce the iteration of this bug, here is how it works:
1) Create 2 lists of files; destination and source with the files sizes and path
2) For each file that is in the destination but not the source
3) Create a subset of the source list containing file of the same size
4) If the subset > 0 hash the destination file and each file in the source subset until a match is found
5) Ensure the dir exists on the destination and move/rename the file.
6) On some systems hash can be as expensive as re-transferring the file so I added an option to move the file if there was one match (only sometimes hashing), and another to skip if more(never hashing).
Though as I am re-evaluating my backup strategy I am looking into git-annex and other solutions.
Looking for this capability prior to entering it as an enhancement request myself, I found everything here and basically have the same use case. My version is that I am creating a regular backup of logs from many servers' services onto a single box, and doing so with rsync. Some of those services still do the .1, .2, .3 file rotation, which makes for a lot of needless work, especially when these are 100+ MiB files. It would be great if rsync could detect this to just transfer the new file and rename the old ninety-nine (or however-many).
On Sun, 06 Mar 2016 22:20:16 +0000
> --- Comment #23 from email@example.com ---
> Looking for this capability prior to entering it as an enhancement
> request myself, I found everything here and basically have the same
> use case. My version is that I am creating a regular backup of logs
> from many servers' services onto a single box, and doing so with
> rsync. Some of those services still do the .1, .2, .3 file rotation,
> which makes for a lot of needless work, especially when these are
> 100+ MiB files. It would be great if rsync could detect this to just
> transfer the new file and rename the old ninety-nine (or
It is not so hard to add the following to your logroate.conf.
# Add a date extension instead of just a number for rsync hardlinked
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein
On Sun, 06 Mar 2016 22:20:16 +0000
> --- Comment #23 from firstname.lastname@example.org ---
> Looking for this capability prior to entering it as an enhancement request
> myself, I found everything here and basically have the same use case. My
> version is that I am creating a regular backup of logs from many servers'
> services onto a single box, and doing so with rsync. Some of those services
> still do the .1, .2, .3 file rotation, which makes for a lot of needless work,
> especially when these are 100+ MiB files. It would be great if rsync could
> detect this to just transfer the new file and rename the old ninety-nine (or
Maybe unison could handle such renames better?
### What's the diff between --fuzzy and --detect-renamed ?
If I understand correctly, --fuzzy looks only in destination folder, for either a file that has an identical size and modified-time, or a similarly-named file, and uses it as a basis file.
Whereas --detect-renamed looks everywhere for files that either match in size & modify-time, or match in size & checksum (when --checksum is used), and uses each match as a basis file.
So the main difference is destination_folder_only vs everywhere, am I right ?
### Some questions :
# --fuzzy can be used twice to look in --link-dest folders, useful when backing-up to an empty directory. What about --detect-renamed ?
# Don't these 2 options kill memory when backing-up many many files (furthermore when also looking in --link-dest folders) ? Don't they maintain in-memory list of files ?
# Will these options only do their job when needed (need to find a basis file), or every time ?
# Do these options impact destination performance, or do they benefit from already-done scans ? For example, will -yy scan all --link-dest dirs (disk IO intensive) even if perhaps it's not needed ?
# About --detect-renamed, let's imagine foo/A has been foud in bar/. Will it be smart enough to directly search for foo/B in bar/, instead of restarting a whole lookup ?
# These 2 options use found file as a basis file. Let's imagine the found file totally matches, and we are using --link-dest. Could we think about linking the file instead of copying it ?
# Last, do you have plans for --detect-renamed onto the trunk ?
Thank you very much for this deep-analysis !
I recently ran into the problem that a large file set got renamed and then re-sent. I tried to fix after the fact, so I went the obvious way of comparing sizes and modtimes on the destination and calculate checksums for potential matches. I would have preferred to use a list of inode numbers and files for the old and new file sets instead...
So I wonder whether a different approach to the problem could make sense:
a) the filelist contains inode numbers, and after a successful rsync, a file is generated in the target dir listing inodes and names of all files transferred
b) when receiving to the same dir, if a target file does not exist, the inode in the filelist is used to look up the previous filename. If it exists and matches in size and modtime, it could be hardlinked. Deleting files from the target that are no longer in the source would take care of the old file. When the sync is completed without error, the list of inodes and file names would be updated
c) when receiving in link-dest mode, the file in the old dir would be consulted for a potential match, and the new list would be created in the target dir
Of course this only makes sense if inode numbers are reliable, as on all standard local file systems or nfs. I do not know whether the new storage arenas preserve inodes. It is obvious that the same inode may appear more than once in a source file set
What is the current state of this? Am I correct that this is still not available in official released versions?
Is this still the right place for tracking it or should it be moved to the GitHub tracker?
As another motivation for this, I use rsync for backups and would like to be able to see whether files have just been renamed or were deleted and some others newly created (which currently cannot be distinguished). That way I can make sure nothing was modified by mistake when backing up my data by going through the output when using dry run before. Basically I want to replace FreeFileSync that has this ability.
This feature request is so old it has lost relavence because btrfs/zfs/etc are more optimal backup solutions than rsync.
To me and others it still seems relevant.
I have to admit though that I haven't looked much into other solutions for backups like btrfs send/receive commands. I suppose that were the ones you meant? Note that simply using btrfs and creating snapshots can be barely seen as a backup. I am rather talking about backing up data to external media (ideally off site, but that is a different story).
(In reply to Claudius Ellsel from comment #31)
Yes any COW FS with "send/receive" will have inherent rename handaling, and will be faster than rsync because the diffs are inherent. With zfs one can even have encrypted backups without the backup server ever seeing the key or un-encrypted data.
COW is not good for databases, VM imgs, etc but that's not where rename detection is useful either.
Hm, those backups won't work on file level, though afaik. Thus I cannot easily access files on a backup drive for example. Also I want to use this as some kind of confirm stage a bit like committing with git where I review all changes to files and confirm them when transferring them to the backup.
(In reply to Claudius Ellsel from comment #33)
Yes you can easily access files on a COW-FS backup; it's a file system, that's what it's for.
If you want to review changes before backup you can just diff or rsync --dry-run snapshot/a snapshot/b locally before sending it.
(In reply to elatllat from comment #34)
>Yes you can easily access files on a COW-FS backup; it's a file system, that's >what it's for.
This is going off-topic, but my backup drive is NTFS currently, which would complicate things probably.
>If you want to review changes before backup you can just diff or rsync --dry-run >snapshot/a snapshot/b locally before sending it.
Unfortunately due to rsync's current inability to detect moved files that would result in exactly the same problem I currently have: Not being able to know whether a file was deleted and another different one was created or whether it is just the same file that was moved (or renamed).
(In reply to Claudius Ellsel from comment #35)
> This is going off-topic
On such an old bug with modern workarouds I think it's worth talking about.
> backup drive is NTFS currently, which would complicate things probably.
Using a COW FS would require you you use said FS (not NTFS ... you can put files systems inside each other with losetup but it's not ideal) If you really don't want to use another filesystem then the options are git or git-annex or that wrapper script I cooked up, or there is likely some backup software with the functionality baked in.
> > If you want to review changes before backup
> [and] detect moved files
you can with a COW FS like this (R for renamed);
# zfs diff pool/volume@snap1 pool/volume@snap2
R /pool/volume/fileB -> /pool/volume/fileC
The btrfs equivalent is a bit more rough but (link for rename);
#./btrfs-snapshots-diff.py -sb -p /media/btrfs/v_1/s_1 -c /media/btrfs/v_1/s_2 | grep -E path=. | grep -v utimes | tail -n +2
This basically is some personal preference. I know that I can do this on btrfs (which is used on the system I want to back up from), also pretty easy with tools like snapper. Maybe it would be feasible to do that comparison everytime before creating a new manual snapshot, which then acts a bit like a "commit". I basically want to have one stage at which I know that I have checked the changes made. Ideally that is at the backup stage, so the backup only has checked changes.
I'll think about that method, maybe even converting my backup drive to btrfs as well.
Back on topic, this feature still comes handy in many situations, imho. There might be workarounds, but those might not be that feasible or wanted. The reality of this not being merged soon might force users to turn to workarounds (which might also have other advantages) though.