Bug 2294 - Detect renamed files and handle by renaming instead of delete/re-send
Detect renamed files and handle by renaming instead of delete/re-send
Status: ASSIGNED
Product: rsync
Classification: Unclassified
Component: core
2.6.3
All All
: P4 enhancement
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
: 6996 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-01 11:56 UTC by Michael Wilson (dead mail address)
Modified: 2017-01-22 15:19 UTC (History)
15 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Wilson (dead mail address) 2005-02-01 11:56:08 UTC
It would be nice if rsync could detect identical files with differing names and
just copy/rename the files instead of sending the data all over again.

As I understand it, rsync creates a single long array listing of the filenames
and associated hashes.  If it's possible to index on the hashes, cross-checked
with file size, this should be fairly straight-forward, requiring no major
redesign to implement.

The enhancement is easily motivated if you think about what happens when rsync
is used to keep two large servers in sync, and a maintainer renames a top-level
directory on the source machine.
Comment 1 BlackB1rd 2005-02-11 01:58:50 UTC
I totally agree this one. With this enhancement there would be no longer
unnecessary traffic when some user has moved / copy'ed a large directory (which
is really annoying).
Comment 2 Wayne Davison 2005-02-12 14:49:37 UTC
This is the basic idea behind fuzzy.diff in the patches dir.  It does not
currently try to find a basis-file match based on size and mtime (just
similarity of names), but I plan to extend it with that functionality when I fix
some of the patch's other minor problems (see the patch for a list of them).
Comment 3 Wayne Davison 2005-02-13 22:24:02 UTC
Note that the --fuzzy patch has made it into the CVS version.  It only looks for
renamed files in the same directory as the file being created, though, so it is
not a full solution to files being moved around in the hierarchy, or directory
names changing (that will require a pre-scan on the receiving side, which is not
currently done unless --delete was specified).

I'll leave this open for now as a suggestion for a more extensive rename detector.
Comment 4 Wayne Davison 2006-02-07 07:25:49 UTC
There is now a patch named detect_renamed.diff in the patches dir that implements the basics of finding renamed files.  This will probably go onto the trunk for the release after 2.6.7.
Comment 5 Bill McGonigle 2006-03-21 14:06:00 UTC
Thanks.  This will be especially useful for log directories where logrotate is incrementing the filename number at each rotation period (httpd.10.gz -> httpd.11.gz).
Comment 6 Boris Folgmann 2007-07-11 09:50:03 UTC
I'm using rsync 2.6.9 to archive rotated log files to another machine, like Bill wrote. I tried 

rsync -avzh --partial --fuzzy src dest

and

rsync -avzh --partial --delete --fuzzy --delete-after src dest

but both calls always copy all renamed/rotated log files. And of course the files are still in the same directory after being rotated! The logs are very large (several gigs) so it takes too long to be a valuable solution.
Is the patch not included in 2.6.9 or did I miss something?
Add-on question: does rsync switch off -z for .gz files in the affected directory? I think that would be a good idea.
Comment 7 Matt McCutchen 2007-10-10 16:09:08 UTC
(In reply to comment #6)
> Is the patch not included in 2.6.9 or did I miss something?

Correct, --detect-renamed still exists as a patch; it is not in the main version of rsync.

> Add-on question: does rsync switch off -z for .gz files in the affected
> directory?

Yes, by default, rsync exempts files with a number of suffixes (including .gz) from -z.  Since rsync 3.0.0, you can customize the list of suffixes with --skip-compress=LIST .
Comment 8 Bill McGonigle 2008-11-30 17:24:40 UTC
(In reply to comment #5)
> Thanks.  This will be especially useful for log directories where logrotate is
> incrementing the filename number at each rotation period (httpd.10.gz ->
> httpd.11.gz).

Since I mentioned this specific use case, I should comment that I recently discovered the 'dateext' option to logrotate which provides a complete workaround in this scenario (which rsync handles perfectly) and might be the better solution for this case in general.

Back on topic, there's still great utility in detecting other rename cases, of course (I often see big .iso's get renamed).  I have to admit to having tried the patch, had trouble with short backups, and backed it out without making a good note of specifics.  What would be generally useful here for reporting problems against the patch?
Comment 9 Shahar Or 2009-03-22 03:00:42 UTC
Dear developers,

I'm interested in this feature so this is a reminder to whoever is involved in this and particularly to Wayne.

Also, I've found the name of the program "Unison" in the context of this issue twice on the mailing list.

Many blessings.
Comment 10 Wayne Davison 2009-12-21 12:35:38 UTC
*** Bug 6996 has been marked as a duplicate of this bug. ***
Comment 11 Philip Ganchev 2010-10-28 00:31:33 UTC
Here are some related discussions about this:

http://www.mail-archive.com/rsync@lists.samba.org/msg20283.html

http://markmail.org/message/kmazkprjvred2r5a

Comment 12 Paul 2011-01-28 20:38:47 UTC
Hi, I was about to enter a similar suggestion to this.  My very frequent use case is moving files from one directory to another.  In that situation the file name does not change--just the directory path leading to it.  These are often quite large files (0.2 to several GB) so avoiding re-copying them would speed things up a lot. 

Thanks

--Paul
Comment 13 Paul 2011-01-28 20:40:23 UTC
x
Comment 15 Michael Monnerie 2011-02-04 02:50:43 UTC
How to apply those 2 detect-renamed* patches? I did
git clone git://git.samba.org/rsync.git
and tried to
patch -p1 <patches/detect-renamed.diff
but that doesn't succeed. Which version would I need to check out to get the patches applied? Sorry, I don't know git.
Comment 16 Benjamin ANDRE 2011-02-04 10:22:34 UTC
you don't need git to get the sources : http://samba.anu.edu.au/ftp/rsync/
and choose "rsync-3.0.7.tar.gz" and "rsync-patches-3.0.7.tar.gz"

Ben
Comment 17 Michael Monnerie 2011-02-04 13:43:23 UTC
Damn, that was too easy ;-) Thanks a lot. I'll test the new detect-renamed* patches now.
Comment 18 Bug Reporter 2012-12-08 10:05:46 UTC
Has this issue been abandoned? It's been a "while"...
Comment 19 Norman Freudenberg 2014-01-04 22:56:00 UTC
Hey as far I found out there are two patches which still note made it into the last official release?
They are still buggy? 
Why didn't it made it to an official release? 
9 Years it quite a long time for a possible solution...
Comment 20 dkl 2014-03-02 03:08:37 UTC
I've been playing with the --detect-renamed patch
https://git.samba.org/?p=rsync-patches.git;a=blob;f=detect-renamed.diff;h=c3e6e846eab437e56e25e2c334e292996ee84345;hb=master

I can't get seem it to work.  Does it rely on other patches?

Anyway, in a simple test, using -vv -a --detect-renamed I can messages about "found renamed", etc, but in a real test, after renaming large directories, there is no speed up.  I can only surmise it's not actually renaming.

I have several applications where this would be a very handy feature to have.  I don't mind using the patch, if could just get it to work...

Btw, I'm on Mac OS 10.9.2.
Comment 21 Petr Pisar 2014-06-02 16:47:06 UTC
There is a bug #8847 in the patchset when partial-dir cannot be created. The fix is described there.
Comment 22 elatllat 2015-01-03 21:23:38 UTC
Wow 10 years.
Maybe one reason this has not been implemented is there are other options. 
For example I have been using a shell script as a wrapper to reduce the iteration of this bug, here is how it works:
1) Create 2 lists of files; destination and source with the files sizes and path
2) For each file that is in the destination but not the source
3) Create a subset of the source list containing file of the same size
4) If the subset > 0 hash the destination file and each file in the source subset until a match is found
5) Ensure the dir exists on the destination and move/rename the file.
6) On some systems hash can be as expensive as re-transferring the file so I added an option to move the file if there was one match (only sometimes hashing), and another to skip if more(never hashing).

Though as I am re-evaluating my backup strategy I am looking into git-annex and other solutions.
https://en.wikipedia.org/wiki/List_of_backup_software#Free_software
Comment 23 dajoker 2016-03-06 22:20:16 UTC
Looking for this capability prior to entering it as an enhancement request myself, I found everything here and basically have the same use case.  My version is that I am creating a regular backup of logs from many servers' services onto a single box, and doing so with rsync.  Some of those services still do the .1, .2, .3 file rotation, which makes for a lot of needless work, especially when these are 100+ MiB files.  It would be great if rsync could detect this to just transfer the new file and rename the old ninety-nine (or however-many).
Comment 24 Karl O. Pinc 2016-03-07 01:37:44 UTC
On Sun, 06 Mar 2016 22:20:16 +0000
samba-bugs@samba.org wrote:

> https://bugzilla.samba.org/show_bug.cgi?id=2294
> 
> --- Comment #23 from dajoker@gmail.com ---
> Looking for this capability prior to entering it as an enhancement
> request myself, I found everything here and basically have the same
> use case.  My version is that I am creating a regular backup of logs
> from many servers' services onto a single box, and doing so with
> rsync.  Some of those services still do the .1, .2, .3 file rotation,
> which makes for a lot of needless work, especially when these are
> 100+ MiB files.  It would be great if rsync could detect this to just
> transfer the new file and rename the old ninety-nine (or
> however-many).

It is not so hard to add the following to your logroate.conf.
Just saying.

# Add a date extension instead of just a number for rsync hardlinked
# backups. 
dateext
dateformat -%Y-%m-%d-%s

Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein
Comment 25 Andrey Gursky 2016-03-07 02:31:03 UTC
On Sun, 06 Mar 2016 22:20:16 +0000
samba-bugs@samba.org wrote:

> https://bugzilla.samba.org/show_bug.cgi?id=2294
> 
> --- Comment #23 from dajoker@gmail.com ---
> Looking for this capability prior to entering it as an enhancement request
> myself, I found everything here and basically have the same use case.  My
> version is that I am creating a regular backup of logs from many servers'
> services onto a single box, and doing so with rsync.  Some of those services
> still do the .1, .2, .3 file rotation, which makes for a lot of needless work,
> especially when these are 100+ MiB files.  It would be great if rsync could
> detect this to just transfer the new file and rename the old ninety-nine (or
> however-many).

Maybe unison could handle such renames better?

Regards,
Andrey
Comment 26 Ben RUBSON 2016-12-30 17:47:59 UTC
### What's the diff between --fuzzy and --detect-renamed ?

If I understand correctly, --fuzzy looks only in destination folder, for either a file that has an identical size and modified-time, or a similarly-named file, and uses it as a basis file.
Whereas --detect-renamed looks everywhere for files that either match in size & modify-time, or match in size & checksum (when --checksum is used), and uses each match as a basis file.
So the main difference is destination_folder_only vs everywhere, am I right ?

### Some questions :

# --fuzzy can be used twice to look in --link-dest folders, useful when backing-up to an empty directory. What about --detect-renamed ?

# Don't these 2 options kill memory when backing-up many many files (furthermore when also looking in --link-dest folders) ? Don't they maintain in-memory list of files ?

# Will these options only do their job when needed (need to find a basis file), or every time ?

# Do these options impact destination performance, or do they benefit from already-done scans ? For example, will -yy scan all --link-dest dirs (disk IO intensive) even if perhaps it's not needed ?

# About --detect-renamed, let's imagine foo/A has been foud in bar/. Will it be smart enough to directly search for foo/B in bar/, instead of restarting a whole lookup ?

# These 2 options use found file as a basis file. Let's imagine the found file totally matches, and we are using --link-dest. Could we think about linking the file instead of copying it ?

# Last, do you have plans for --detect-renamed onto the trunk ?

Thank you very much for this deep-analysis !

Ben
Comment 27 Wolfgang Hamann 2017-01-22 15:19:34 UTC
Hi,

I recently ran into the problem that a large file set got renamed and then re-sent. I tried to fix after the fact, so I went the obvious way of comparing sizes and modtimes on the destination and calculate checksums for potential matches. I would have preferred to use a list of inode numbers and files for the old and new file sets instead...

So I wonder whether a different approach to the problem could make sense:
a) the filelist contains inode numbers, and after a successful rsync, a file is generated in the target dir listing inodes and names of all files transferred
b) when receiving to the same dir, if a target file does not exist, the inode in the filelist is used to look up the previous filename. If it exists and matches in size and modtime, it could be hardlinked. Deleting files from the target that are no longer in the source would take care of the old file. When the sync is completed without error, the list of inodes and file names would be updated
c) when receiving in link-dest mode, the file in the old dir would be consulted for a potential match, and the new list would be created in the target dir

Of course this only makes sense if inode numbers are reliable, as on all standard local file systems or nfs. I do not know whether the new storage arenas preserve inodes. It is obvious that the same inode may appear more than once in a source file set

Regards
Wolfgang