Bug 2294 - Detect renamed files and handle by renaming instead of delete/re-send
Summary: Detect renamed files and handle by renaming instead of delete/re-send
Status: ASSIGNED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.3
Hardware: All All
: P4 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
: 6996 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-02-01 11:56 UTC by Michael Wilson (dead mail address)
Modified: 2024-04-03 01:23 UTC (History)
17 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Wilson (dead mail address) 2005-02-01 11:56:08 UTC
It would be nice if rsync could detect identical files with differing names and
just copy/rename the files instead of sending the data all over again.

As I understand it, rsync creates a single long array listing of the filenames
and associated hashes.  If it's possible to index on the hashes, cross-checked
with file size, this should be fairly straight-forward, requiring no major
redesign to implement.

The enhancement is easily motivated if you think about what happens when rsync
is used to keep two large servers in sync, and a maintainer renames a top-level
directory on the source machine.
Comment 1 BlackB1rd 2005-02-11 01:58:50 UTC
I totally agree this one. With this enhancement there would be no longer
unnecessary traffic when some user has moved / copy'ed a large directory (which
is really annoying).
Comment 2 Wayne Davison 2005-02-12 14:49:37 UTC
This is the basic idea behind fuzzy.diff in the patches dir.  It does not
currently try to find a basis-file match based on size and mtime (just
similarity of names), but I plan to extend it with that functionality when I fix
some of the patch's other minor problems (see the patch for a list of them).
Comment 3 Wayne Davison 2005-02-13 22:24:02 UTC
Note that the --fuzzy patch has made it into the CVS version.  It only looks for
renamed files in the same directory as the file being created, though, so it is
not a full solution to files being moved around in the hierarchy, or directory
names changing (that will require a pre-scan on the receiving side, which is not
currently done unless --delete was specified).

I'll leave this open for now as a suggestion for a more extensive rename detector.
Comment 4 Wayne Davison 2006-02-07 07:25:49 UTC
There is now a patch named detect_renamed.diff in the patches dir that implements the basics of finding renamed files.  This will probably go onto the trunk for the release after 2.6.7.
Comment 5 Bill McGonigle (dead mail address) 2006-03-21 14:06:00 UTC
Thanks.  This will be especially useful for log directories where logrotate is incrementing the filename number at each rotation period (httpd.10.gz -> httpd.11.gz).
Comment 6 Boris Folgmann (dead mail address) 2007-07-11 09:50:03 UTC
I'm using rsync 2.6.9 to archive rotated log files to another machine, like Bill wrote. I tried 

rsync -avzh --partial --fuzzy src dest

and

rsync -avzh --partial --delete --fuzzy --delete-after src dest

but both calls always copy all renamed/rotated log files. And of course the files are still in the same directory after being rotated! The logs are very large (several gigs) so it takes too long to be a valuable solution.
Is the patch not included in 2.6.9 or did I miss something?
Add-on question: does rsync switch off -z for .gz files in the affected directory? I think that would be a good idea.
Comment 7 Matt McCutchen 2007-10-10 16:09:08 UTC
(In reply to comment #6)
> Is the patch not included in 2.6.9 or did I miss something?

Correct, --detect-renamed still exists as a patch; it is not in the main version of rsync.

> Add-on question: does rsync switch off -z for .gz files in the affected
> directory?

Yes, by default, rsync exempts files with a number of suffixes (including .gz) from -z.  Since rsync 3.0.0, you can customize the list of suffixes with --skip-compress=LIST .
Comment 8 Bill McGonigle (dead mail address) 2008-11-30 17:24:40 UTC
(In reply to comment #5)
> Thanks.  This will be especially useful for log directories where logrotate is
> incrementing the filename number at each rotation period (httpd.10.gz ->
> httpd.11.gz).

Since I mentioned this specific use case, I should comment that I recently discovered the 'dateext' option to logrotate which provides a complete workaround in this scenario (which rsync handles perfectly) and might be the better solution for this case in general.

Back on topic, there's still great utility in detecting other rename cases, of course (I often see big .iso's get renamed).  I have to admit to having tried the patch, had trouble with short backups, and backed it out without making a good note of specifics.  What would be generally useful here for reporting problems against the patch?
Comment 9 Shahar Or (dead mail address) 2009-03-22 03:00:42 UTC
Dear developers,

I'm interested in this feature so this is a reminder to whoever is involved in this and particularly to Wayne.

Also, I've found the name of the program "Unison" in the context of this issue twice on the mailing list.

Many blessings.
Comment 10 Wayne Davison 2009-12-21 12:35:38 UTC
*** Bug 6996 has been marked as a duplicate of this bug. ***
Comment 11 Philip Ganchev 2010-10-28 00:31:33 UTC
Here are some related discussions about this:

http://www.mail-archive.com/rsync@lists.samba.org/msg20283.html

http://markmail.org/message/kmazkprjvred2r5a

Comment 12 Paul 2011-01-28 20:38:47 UTC
Hi, I was about to enter a similar suggestion to this.  My very frequent use case is moving files from one directory to another.  In that situation the file name does not change--just the directory path leading to it.  These are often quite large files (0.2 to several GB) so avoiding re-copying them would speed things up a lot. 

Thanks

--Paul
Comment 13 Paul 2011-01-28 20:40:23 UTC
x
Comment 15 Michael Monnerie 2011-02-04 02:50:43 UTC
How to apply those 2 detect-renamed* patches? I did
git clone git://git.samba.org/rsync.git
and tried to
patch -p1 <patches/detect-renamed.diff
but that doesn't succeed. Which version would I need to check out to get the patches applied? Sorry, I don't know git.
Comment 16 Benjamin ANDRE 2011-02-04 10:22:34 UTC
you don't need git to get the sources : http://samba.anu.edu.au/ftp/rsync/
and choose "rsync-3.0.7.tar.gz" and "rsync-patches-3.0.7.tar.gz"

Ben
Comment 17 Michael Monnerie 2011-02-04 13:43:23 UTC
Damn, that was too easy ;-) Thanks a lot. I'll test the new detect-renamed* patches now.
Comment 18 Bug Reporter 2012-12-08 10:05:46 UTC
Has this issue been abandoned? It's been a "while"...
Comment 19 Norman Freudenberg 2014-01-04 22:56:00 UTC
Hey as far I found out there are two patches which still note made it into the last official release?
They are still buggy? 
Why didn't it made it to an official release? 
9 Years it quite a long time for a possible solution...
Comment 20 dkl 2014-03-02 03:08:37 UTC
I've been playing with the --detect-renamed patch
https://git.samba.org/?p=rsync-patches.git;a=blob;f=detect-renamed.diff;h=c3e6e846eab437e56e25e2c334e292996ee84345;hb=master

I can't get seem it to work.  Does it rely on other patches?

Anyway, in a simple test, using -vv -a --detect-renamed I can messages about "found renamed", etc, but in a real test, after renaming large directories, there is no speed up.  I can only surmise it's not actually renaming.

I have several applications where this would be a very handy feature to have.  I don't mind using the patch, if could just get it to work...

Btw, I'm on Mac OS 10.9.2.
Comment 21 Petr Pisar 2014-06-02 16:47:06 UTC
There is a bug #8847 in the patchset when partial-dir cannot be created. The fix is described there.
Comment 22 elatllat 2015-01-03 21:23:38 UTC
Wow 10 years.
Maybe one reason this has not been implemented is there are other options. 
For example I have been using a shell script as a wrapper to reduce the iteration of this bug, here is how it works:
1) Create 2 lists of files; destination and source with the files sizes and path
2) For each file that is in the destination but not the source
3) Create a subset of the source list containing file of the same size
4) If the subset > 0 hash the destination file and each file in the source subset until a match is found
5) Ensure the dir exists on the destination and move/rename the file.
6) On some systems hash can be as expensive as re-transferring the file so I added an option to move the file if there was one match (only sometimes hashing), and another to skip if more(never hashing).

Though as I am re-evaluating my backup strategy I am looking into git-annex and other solutions.
https://en.wikipedia.org/wiki/List_of_backup_software#Free_software
Comment 23 dajoker 2016-03-06 22:20:16 UTC
Looking for this capability prior to entering it as an enhancement request myself, I found everything here and basically have the same use case.  My version is that I am creating a regular backup of logs from many servers' services onto a single box, and doing so with rsync.  Some of those services still do the .1, .2, .3 file rotation, which makes for a lot of needless work, especially when these are 100+ MiB files.  It would be great if rsync could detect this to just transfer the new file and rename the old ninety-nine (or however-many).
Comment 24 Karl O. Pinc 2016-03-07 01:37:44 UTC
On Sun, 06 Mar 2016 22:20:16 +0000
samba-bugs@samba.org wrote:

> https://bugzilla.samba.org/show_bug.cgi?id=2294
> 
> --- Comment #23 from dajoker@gmail.com ---
> Looking for this capability prior to entering it as an enhancement
> request myself, I found everything here and basically have the same
> use case.  My version is that I am creating a regular backup of logs
> from many servers' services onto a single box, and doing so with
> rsync.  Some of those services still do the .1, .2, .3 file rotation,
> which makes for a lot of needless work, especially when these are
> 100+ MiB files.  It would be great if rsync could detect this to just
> transfer the new file and rename the old ninety-nine (or
> however-many).

It is not so hard to add the following to your logroate.conf.
Just saying.

# Add a date extension instead of just a number for rsync hardlinked
# backups. 
dateext
dateformat -%Y-%m-%d-%s

Karl <kop@meme.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein
Comment 25 Andrey Gursky 2016-03-07 02:31:03 UTC
On Sun, 06 Mar 2016 22:20:16 +0000
samba-bugs@samba.org wrote:

> https://bugzilla.samba.org/show_bug.cgi?id=2294
> 
> --- Comment #23 from dajoker@gmail.com ---
> Looking for this capability prior to entering it as an enhancement request
> myself, I found everything here and basically have the same use case.  My
> version is that I am creating a regular backup of logs from many servers'
> services onto a single box, and doing so with rsync.  Some of those services
> still do the .1, .2, .3 file rotation, which makes for a lot of needless work,
> especially when these are 100+ MiB files.  It would be great if rsync could
> detect this to just transfer the new file and rename the old ninety-nine (or
> however-many).

Maybe unison could handle such renames better?

Regards,
Andrey
Comment 26 Ben RUBSON 2016-12-30 17:47:59 UTC
### What's the diff between --fuzzy and --detect-renamed ?

If I understand correctly, --fuzzy looks only in destination folder, for either a file that has an identical size and modified-time, or a similarly-named file, and uses it as a basis file.
Whereas --detect-renamed looks everywhere for files that either match in size & modify-time, or match in size & checksum (when --checksum is used), and uses each match as a basis file.
So the main difference is destination_folder_only vs everywhere, am I right ?

### Some questions :

# --fuzzy can be used twice to look in --link-dest folders, useful when backing-up to an empty directory. What about --detect-renamed ?

# Don't these 2 options kill memory when backing-up many many files (furthermore when also looking in --link-dest folders) ? Don't they maintain in-memory list of files ?

# Will these options only do their job when needed (need to find a basis file), or every time ?

# Do these options impact destination performance, or do they benefit from already-done scans ? For example, will -yy scan all --link-dest dirs (disk IO intensive) even if perhaps it's not needed ?

# About --detect-renamed, let's imagine foo/A has been foud in bar/. Will it be smart enough to directly search for foo/B in bar/, instead of restarting a whole lookup ?

# These 2 options use found file as a basis file. Let's imagine the found file totally matches, and we are using --link-dest. Could we think about linking the file instead of copying it ?

# Last, do you have plans for --detect-renamed onto the trunk ?

Thank you very much for this deep-analysis !

Ben
Comment 27 Wolfgang Hamann 2017-01-22 15:19:34 UTC
Hi,

I recently ran into the problem that a large file set got renamed and then re-sent. I tried to fix after the fact, so I went the obvious way of comparing sizes and modtimes on the destination and calculate checksums for potential matches. I would have preferred to use a list of inode numbers and files for the old and new file sets instead...

So I wonder whether a different approach to the problem could make sense:
a) the filelist contains inode numbers, and after a successful rsync, a file is generated in the target dir listing inodes and names of all files transferred
b) when receiving to the same dir, if a target file does not exist, the inode in the filelist is used to look up the previous filename. If it exists and matches in size and modtime, it could be hardlinked. Deleting files from the target that are no longer in the source would take care of the old file. When the sync is completed without error, the list of inodes and file names would be updated
c) when receiving in link-dest mode, the file in the old dir would be consulted for a potential match, and the new list would be created in the target dir

Of course this only makes sense if inode numbers are reliable, as on all standard local file systems or nfs. I do not know whether the new storage arenas preserve inodes. It is obvious that the same inode may appear more than once in a source file set

Regards
Wolfgang
Comment 28 Claudius Ellsel 2021-01-15 13:39:10 UTC
What is the current state of this? Am I correct that this is still not available in official released versions?

Is this still the right place for tracking it or should it be moved to the GitHub tracker?
Comment 29 Claudius Ellsel 2021-01-15 13:45:48 UTC
As another motivation for this, I use rsync for backups and would like to be able to see whether files have just been renamed or were deleted and some others newly created (which currently cannot be distinguished). That way I can make sure nothing was modified by mistake when backing up my data by going through the output when using dry run before. Basically I want to replace FreeFileSync that has this ability.
Comment 30 elatllat 2021-01-15 14:19:12 UTC
This feature request is so old it has lost relavence because btrfs/zfs/etc are more optimal backup solutions than rsync.
Comment 31 Claudius Ellsel 2021-01-15 14:41:40 UTC
To me and others it still seems relevant.

I have to admit though that I haven't looked much into other solutions for backups like btrfs send/receive commands. I suppose that were the ones you meant? Note that simply using btrfs and creating snapshots can be barely seen as a backup. I am rather talking about backing up data to external media (ideally off site, but that is a different story).
Comment 32 elatllat 2021-01-15 15:04:39 UTC
(In reply to Claudius Ellsel from comment #31)
Yes any COW FS with "send/receive" will have inherent rename handaling, and will be faster than rsync because the diffs are inherent. With zfs one can even have encrypted backups without the backup server ever seeing the key or un-encrypted data.
COW is not good for databases, VM imgs, etc but that's not where rename detection is useful either.
Comment 33 Claudius Ellsel 2021-01-15 15:56:36 UTC
Hm, those backups won't work on file level, though afaik. Thus I cannot easily access files on a backup drive for example. Also I want to use this as some kind of confirm stage a bit like committing with git where I review all changes to files and confirm them when transferring them to the backup.
Comment 34 elatllat 2021-01-15 16:36:37 UTC
(In reply to Claudius Ellsel from comment #33)
Yes you can easily access files on a COW-FS backup; it's a file system, that's what it's for.

If you want to review changes before backup you can just diff or rsync --dry-run snapshot/a snapshot/b locally before sending it.
Comment 35 Claudius Ellsel 2021-01-15 17:08:00 UTC
(In reply to elatllat from comment #34)
>Yes you can easily access files on a COW-FS backup; it's a file system, that's >what it's for.

This is going off-topic, but my backup drive is NTFS currently, which would complicate things probably.

>If you want to review changes before backup you can just diff or rsync --dry-run >snapshot/a snapshot/b locally before sending it.

Unfortunately due to rsync's current inability to detect moved files that would result in exactly the same problem I currently have: Not being able to know whether a file was deleted and another different one was created or whether it is just the same file that was moved (or renamed).
Comment 36 elatllat 2021-01-15 18:40:10 UTC
(In reply to Claudius Ellsel from comment #35)
> This is going off-topic

On such an old bug with modern workarouds I think it's worth talking about.

> backup drive is NTFS currently, which would complicate things probably.

Using a COW FS would require you you use said FS (not NTFS ... you can put files systems inside each other with losetup but it's not ideal) If you really don't want to use another filesystem then the options are git or git-annex or that wrapper script I cooked up, or there is likely some backup software with the functionality baked in. 

> > If you want to review changes before backup
> [and] detect moved files

you can with a COW FS like this (R for renamed); 

# zfs diff pool/volume@snap1 pool/volume@snap2
M       /pool/volume/
R       /pool/volume/fileB -> /pool/volume/fileC
Comment 37 elatllat 2021-01-16 01:18:31 UTC
The btrfs equivalent is a bit more rough but (link for rename);

#./btrfs-snapshots-diff.py -sb -p /media/btrfs/v_1/s_1 -c /media/btrfs/v_1/s_2 | grep -E path=. | grep -v utimes | tail -n +2
link;path=fileB;path_link=fileC
unlink;path=fileB
Comment 38 Claudius Ellsel 2021-01-16 13:31:35 UTC
This basically is some personal preference. I know that I can do this on btrfs (which is used on the system I want to back up from), also pretty easy with tools like snapper. Maybe it would be feasible to do that comparison everytime before creating a new manual snapshot, which then acts a bit like a "commit". I basically want to have one stage at which I know that I have checked the changes made. Ideally that is at the backup stage, so the backup only has checked changes.

I'll think about that method, maybe even converting my backup drive to btrfs as well.

Back on topic, this feature still comes handy in many situations, imho. There might be workarounds, but those might not be that feasible or wanted. The reality of this not being merged soon might force users to turn to workarounds (which might also have other advantages) though.
Comment 39 andy 2024-02-09 10:04:28 UTC
> This feature request is so old it has lost relavence because btrfs/zfs/etc are more optimal backup solutions than rsync.

Funny I am doing exactly this, but I came to rsync looking for a backup for when ZFS fails. Many consider zfs/btrfs/snapshots as "not a backup". There are many things that can go wrong that you will need real backups to save you:
- accidentally deleted pool/datasets/snapshots
- bug in replication tool
- user error
- ZFS bug

It should be considered a certainty that one of those will happen at some point, and the ZFS snapshots won't save you.

> With zfs one can even have encrypted backups without the backup server ever seeing the key or un-encrypted data.

I love this idea, but in practice I'm finding there is significant risk with the state of ZFS encryption. There are so many active bugs related to encryption. I'm in the middle of implementing a replication system based on raw encrypted snapshot replication between multiple systems, trusted and untrusted. But the new bugs I've run into along the way, along with the previously known ones, makes me really feel the need for a solid non-ZFS filesystem backup. And also a low complexity tool, not dependent on complicated replication tools/scripts.

In looking for rename/move solutions with rsync, one issue I can foresee with inode tracking is that I find it is very common to cross filesystem boundaries. Anything tracking inodes would need to track the device as well, though the device number from the stat struct doesn't seem to be enough in the case of ZFS to trace back to what filesystem it actually comes from. 

Reading the unison documentation, it seems that for linux they track the combo of inode & last modification time to detect moves/renames. I wonder if some kind of collision is possible under a rare multi-filesystem edge case. Inodes aren't unique across multiple filesystems.

Restic is another option that handles moves/renames/dedup automatically, just at the cost of CPU time for encrypting/hashing. Probably worth considering borg and friends at that point.

Well, maybe the cost of rsync's inefficiency here is worth it's simplicity. But it would be a great feature to have.
Comment 40 roland 2024-02-09 10:47:29 UTC
> This feature request is so old it has lost relavence because btrfs/zfs/etc are more optimal backup solutions than rsync.

sorry, but this comment is totally nonsense, as btrfs and zfs are filesystems (i.e. places to store data) whereas rsync is a tool to sync data from one place to another.  

we are using zfs + snapshots + rsync as an enterprise wide backup solution, as we mostly run unix based systems.
Comment 41 Mihnea-Costin Grigore 2024-04-03 01:23:35 UTC
The discussion about file systems like ZFS/BTRFS/etc. and their various snapshot mechanisms is off-topic relative to this feature request, since they are very different technologies used for different purposes.

rsync is used commonly to synchronise at the *file level* between very different operating systems -- from Linux to Windows, from macOS to Linux, etc. It also has multiple features to allow filtering and selecting files within the source and destination directories, thus only synchronising a subset of files. We need a solution that works with the flexibility of rsync itself, and snapshots (useful as they are) do not fit that at all.

This would be a very useful feature to have as part of rsync, made evident by the many requests both here and on other discussion forums over the past almost 20 (!) years. Can we rather discuss what the blockers are for the existing patches? What stops them from inclusion -- is it quality, functionality, compatibility, test coverage, something else? Working to fix that and making the already existing patches acceptable for inclusion would appear to be the most constructive course of action, rather than deflecting the request.