Bug 3693 - rsync -H should break outdated hard-links to identical files
Summary: rsync -H should break outdated hard-links to identical files
Status: ASSIGNED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.9
Hardware: x86 Linux
: P3 enhancement with 20 votes (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-04-17 11:26 UTC by Matt McCutchen
Modified: 2024-06-14 16:06 UTC (History)
2 users (show)

See Also:


Attachments
Suggested clarification (1007 bytes, patch)
2008-01-30 18:46 UTC, Matt McCutchen
no flags Details
Updated man page patch (4.56 KB, patch)
2009-11-18 23:31 UTC, Matt McCutchen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt McCutchen 2006-04-17 11:26:46 UTC
Run the following in an empty directory:
    mkdir src dest linkdest
    touch src/f1
    rsync -a src/f1 src/f2
    touch linkdest/f1
    ln linkdest/f1 linkdest/f2
    rsync -Ha src/ dest/ --link-dest=../linkdest/

The source files src/f1 and src/f2 are both identical to the single link-dest file that has the names linkdest/f1 and linkdest/f2.  Rsync links linkdest/f1 instead of copying src/f1 and links linkdest/f2 instead of copying src/f2.  Now dest/f1 and dest/f2 refer to the same file while src/f1 and src/f2 refer to different files.

I believe that, when -H is specified, two destination dentries should refer to the same file if and only if the corresponding source dentries do, even though there may be hard links outside the transfer to both source files and (because of --link-dest destination files.  Thus, rsync should guard against linking to the same --link-dest file several times.  When rsync links to a --link-dest file, it should check whether a file with the same device and inode numbers has already been used; if so, rsync should copy the file into the destination instead of linking it.  When -H is not specified, I reason that the user doesn't care about hard links and this check is unnecessary.

The remark at the end of comment 1 of bug 3692 led me to discover this bug.  (Let's see if Bugzilla correctly hyperlinks that reference.)
Comment 1 hoffa 2006-04-17 14:15:48 UTC
> ...
> The source files src/f1 and src/f2 are both identical to the single
> link-dest file that has the names linkdest/f1 and linkdest/f2.

the contents are the same but some of the meta information is not.

> Rsync links linkdest/f1 instead of copying src/f1 and links linkdest/f2
> instead of copying src/f2.  Now dest/f1 and dest/f2 refer to the same
> file while src/f1 and src/f2 refer to different files.

can you post the output inline with commands as i'm not seeing this here...

[] ls -aliT .
[]  <nothing>
[] sleep 2 ; mkdir src
[] sleep 2 ; mkdir dest
[] sleep 2 ; mkdir linkdest
[] sleep 2 ; touch src/f1
[] sleep 2 ; rsync -a src/f1 src/f2
[] sleep 2 ; touch linkdest/f1
[] sleep 2 ; ln linkdest/f1 linkdest/f2
[] ls -aliT *
dest:
 <nothing>
linkdest:
238252 -rw-r--r--  2 moo  moo    0 Apr 17 14:12:30 2006 f1
238252 -rw-r--r--  2 moo  moo    0 Apr 17 14:12:30 2006 f2
src:
238250 -rw-r--r--  1 moo  moo    0 Apr 17 14:12:26 2006 f1
238251 -rw-r--r--  1 moo  moo    0 Apr 17 14:12:26 2006 f2

# i modified the erroneous[?] link-dest usage of both a relative and a
#  nonexistant path. in the past i had problems with relative. should
#  rsync complain about nonexistance with -v?
[] sleep 2 ; rsync -Hav src/ dest/ --link-dest=`pwd`/linkdest/
./
f1
f2
[] ls -aliT *
dest:
238287 -rw-r--r--  1 moo  moo    0 Apr 17 14:12:26 2006 f1
238288 -rw-r--r--  1 moo  moo    0 Apr 17 14:12:26 2006 f2
linkdest:
238252 -rw-r--r--  2 moo  moo    0 Apr 17 14:12:30 2006 f1
238252 -rw-r--r--  2 moo  moo    0 Apr 17 14:12:30 2006 f2
src:
238250 -rw-r--r--  1 moo  moo    0 Apr 17 14:12:26 2006 f1
238251 -rw-r--r--  1 moo  moo    0 Apr 17 14:12:26 2006 f2


> I believe that, when -H is specified, two destination dentries should refer
> to the same file if and only if the corresponding source dentries do

true, but not the same inode number across src and dest. only within src. and
only within dest. 'same file' means 'same inode number'. there are cross host
and cross filesystem aspects as well. besides, src and dest must be different
inode sets, otherwise munging a source file would also munge your backup ;-]

in your example, at least as reproduced above, src/{f1,f2} are not hardlinks.
so rsync has no obligation to, and indeed should not, make dest/{f1,f2}
hardlinks. with or without -H. if src/{f1,f2} were hardlinks, with -H would
preserve that relationship in dest. without -H would just make copies.

remember, a --link-dest directory is only used as a reference to save
space/time in the destination. the dest must always mirror the src regardless
of what's in any chosen --link-dest dir. in your example, linkdest/{f1,f2}
could be thought of as a previous mirror of the src that _did_ have
src/{f1,f2} hardlinked, before something came along and broke them up in the
src. thus now the new dest has them correctly separated.

> <remainder of report>
not sure how to read that. though i think the current behaviour is correct.

as an aside, any change to:
 name, perm, uid, gid, size, contents, mtime, existance,
 symlink '-> source' or hardlink relationship
will/should cause the -Ha --link-dest --delete version to be ignored and a
new copy to be made in dest. there is both mtime and hardlink change in the
example.
Comment 2 Matt McCutchen 2006-04-17 15:23:20 UTC
(In reply to comment #1)
> the contents are the same but some of the meta information is not.

Oops: the mtimes were different.  Now I'm not sure how I tickled the bug the first time.  Corrected script:

mkdir src dest linkdest
touch src/f1
rsync -a src/f1 src/f2
rsync -a src/f1 linkdest/f1
ln linkdest/f1 linkdest/f2
rsync -Ha src/ dest/ --link-dest=../linkdest/

> > Rsync links linkdest/f1 instead of copying src/f1 and links linkdest/f2
> > instead of copying src/f2.  Now dest/f1 and dest/f2 refer to the same
> > file while src/f1 and src/f2 refer to different files.
> 
> can you post the output inline with commands as i'm not seeing this here...

Please try again with the corrected script.  On my computer, "find . -ls" after the corrected script has finished produces the following output (some spaces removed to make it narrower):

  1908  0 drwx------  5 matt  matt  120 Apr 17 16:11 .
482092  0 drwx------  2 matt  matt   96 Apr 17 16:11 ./src
482105  0 -rw-------  1 matt  matt    0 Apr 17 16:11 ./src/f1
482106  0 -rw-------  1 matt  matt    0 Apr 17 16:11 ./src/f2
482098  0 drwx------  2 matt  matt   96 Apr 17 16:11 ./dest
482107  0 -rw-------  4 matt  matt    0 Apr 17 16:11 ./dest/f1
482107  0 -rw-------  4 matt  matt    0 Apr 17 16:11 ./dest/f2
482101  0 drwx------  2 matt  matt   96 Apr 17 16:11 ./linkdest
482107  0 -rw-------  4 matt  matt    0 Apr 17 16:11 ./linkdest/f1
482107  0 -rw-------  4 matt  matt    0 Apr 17 16:11 ./linkdest/f2

> # i modified the erroneous[?] link-dest usage of both a relative and a
> #  nonexistant path. in the past i had problems with relative. should
> #  rsync complain about nonexistance with -v?

No, my --link-dest usage is correct.  The description of --link-dest=DIR says, "If DIR is a relative path, it is relative to the destination directory."

Please try my corrected script and see if the rest of your remarks still apply.
Comment 3 hoffa 2006-04-17 21:20:19 UTC
> "If DIR is a relative path, it is relative to the destination directory."

apologies indeed, missed that in the man page.

> Please try my corrected script and see if the rest of your remarks still
> apply.

yep, that that looks like a bug now. seems that dest/{f1,f2} should each get
their own unique inums as they are unique in the src and linkdest is just
a stale image of src. i used the cvs HEAD to test.

maybe rsync is seeing that src/{f1,f2} still have everything _but_ the inode
relationship in linkdest the same, assumes that's enough and links the
dest versions back to linkdest. -H would imply to check that too.

> ...

still not sure of the description of the proposed solution. seems that some
rather crazy hardlink counts would be out there in the wild but that as long
as the structures that hold the pictures for src, linkdest and dest are the
same, except for the inums between them and old non-matching gunk in linkdest,
it'd be cool. but hey, you're probably right, some of us people just have
brain drain from filing taxes ;-]


part 2...
and running a plain one does not fix them once broken either, yikes.
figured until fixed i could just run this over top of it and be done.
# /tmp/rsync -Haxv --delete ./src/ ./dest/
# find src linkdest dest -ls
550518    4 drwxr-xr-x    2 root     wheel   512 Apr 17 21:02 src
550488    0 -rw-r--r--    1 root     wheel     0 Apr 17 21:02 src/f1
550512    0 -rw-r--r--    1 root     wheel     0 Apr 17 21:02 src/f2
550482    4 drwxr-xr-x    2 root     wheel   512 Apr 17 21:02 linkdest
550264    0 -rw-r--r--    4 root     wheel     0 Apr 17 21:02 linkdest/f1
550264    0 -rw-r--r--    4 root     wheel     0 Apr 17 21:02 linkdest/f2
569089    4 drwxr-xr-x    2 root     wheel   512 Apr 17 21:02 dest
550264    0 -rw-r--r--    4 root     wheel     0 Apr 17 21:02 dest/f1
550264    0 -rw-r--r--    4 root     wheel     0 Apr 17 21:02 dest/f2

if you blow away dest and do a plain -Ha copy it works as expected.
i'd rather have an accurate copy over free cpu/ram if that matters.

doesn't look to be new as it's present in:
 rsync267
 rsync20050802
 rsync264pre2
 rsync263

cheers all.
Comment 4 Wayne Davison 2006-04-22 11:56:14 UTC
Rsync has other problems with outdated hard-links not being broken.  For instance:

echo data >foo
ln foo bar
rsync -aH foo bar dest/
rm bar
cp -p foo bar
rsync -aH foo bar dest/

That sequence will not break the hard-link that exists in the destination files.  However, if either of the iles had been touched, the second rsync would have broken the link when updating the file (assuming that --inplace wasn't used).

The bug you cited with --link-dest springs from the same roots as this.  It would require the in-memory hashing of the inode of every hard-linked file on the recieving side for rsync to be able to break links that were no longer present, and that would be quite a lot of extra memory when using --link-dest to a large hierarchy of mostly unchanged files.

I don't see this being fixed soon, but I should take a look at it after I work on reducing rsync's memory requirements.
Comment 5 Matt McCutchen 2007-12-29 15:36:24 UTC
Wayne, if you consider the breaking of outdated hard links not to be part of the expected behavior of -H, please add a clarification to this effect to the man page.
Comment 6 Matt McCutchen 2008-01-30 18:46:04 UTC
Created attachment 3129 [details]
Suggested clarification
Comment 7 Matt McCutchen 2008-02-06 21:55:53 UTC
There's another aspect to this problem: a file's attributes can be tweaked unexpectedly through an outdated hard link.  Example:

$ echo data >foo
$ chmod 600 foo
$ ln foo bar
$ rsync -aHi foo bar dest/
>f+++++++++ foo
hf+++++++++ bar => foo
$ rm bar
$ cp -a foo bar
$ chmod 644 bar
$ rsync -aHi foo bar dest/
.f...p..... bar
.f...p..... foo
$ rsync -aHi foo bar dest/
.f...p..... bar
.f...p..... foo
$ rsync -aHi foo bar dest/
.f...p..... bar
.f...p..... foo

--no-tweak-hlinked would fix this.
Comment 8 Matt McCutchen 2009-11-18 22:17:32 UTC
Wayne, I'd like to see the wording in the man page amplified to what I originally proposed in comment #6 to ensure that there is no confusion about the role of -H in a backup process.  This just came up on the rsnapshot list:

http://sourceforge.net/mailarchive/forum.php?thread_name=1258603850.25245.6.camel%40mattlaptop2.local&forum_name=rsnapshot-discuss
Comment 9 Matt McCutchen 2009-11-18 23:31:17 UTC
Created attachment 4964 [details]
Updated man page patch

This patch revises the --hard-links description to describe both cases (nonempty destination and --link-dest), intentionally leaving open the possibility of more (I had --detect-renamed-lax without --delete in mind but didn't want to mention it on the trunk).  I also took the opportunity to revise the --inplace description.  For posterity, here is the tug-of-war case I mentioned:

$ mkdir src dest
$ touch dest/1
$ ln dest/1 dest/2
$ echo foo >src/1
$ echo blort >src/2
$ rsync -rt --inplace -i src/ dest/
.d..t...... ./
>f.st...... 1
>f.st...... 2
$ rsync -rt --inplace -i src/ dest/
>f.st...... 1
$ rsync -rt --inplace -i src/ dest/
>f.st...... 2
$ rsync -rt --inplace -i src/ dest/
>f.st...... 1
$ rsync -rt --inplace -i src/ dest/
>f.st...... 2
Comment 10 jcea 2012-08-10 01:51:18 UTC
This bug is still living in 3.0.9. Check https://lists.samba.org/archive/rsync/2012-August/027799.html
Comment 11 Linda Walsh 2012-08-12 23:48:22 UTC
Note -- This is shouldn't be qualified as an enhancement, as if the -H option is used, it is supposed to duplicate the hard link structure on the source.  Not doing so is a bug.

Just like cp in core had (may still have) a bug in coreutils when copying
-- and it's related to this exact same thing...

copying from a source preserving but ignoring file OS (windows) to 
an OS where case is different (but they were hard linked to each other).

cp wanted to copy Afile =>  a dir had 'afile' & 'Afile'.  It thought it needed
to remove [aA]file, and copy over only Afile, as it ws an updated version 
(while 'afile' still existed on source as a separate older file)....

hmmm....this sounds like a similar case.

Unfortunately, they passed it off as a cygwin only bug -- but they coudln't
tell the difference between lower and upper case versions of the same file
when they were linked -- even though they can showup in a dir listing as separate.

Dunnow if it got fixed or not -- I stopped using cp -u where there was danger
of hard links...AFAIK, they never fixed it because it was tossed off to the cygwin group who promptly forgot about it.
Comment 12 Ben Millwood 2024-06-14 16:05:27 UTC
hi folks, I've run into this problem in a couple of cases that I think haven't been mentioned so far:

- Every month I rsync my boot disk to an external disk and then take a ZFS copy-on-write snapshot of the external disk. This means I want to use --inplace --no-whole-file to avoid writing (therefore copying) more than necessary, but in combination with --hard-links this can result in incorrect file content (as others have described) if I initially hard link some files but then later unlink and then modify them.
- Credit where it's due, the man page does warn me of the above problem, so currently I just don't use --inplace --no-whole-file and take the disk usage hit. However, although this eliminates problems with file content being incorrect, AIUI it still only breaks hard links when one of the formerly-hardlinked files changes, which means I still can end up with destination hard links that aren't on the source if the source files weren't changed when the link was broken. In practice I can't think why I would break a hard link but keep the contents the same, but it's frustrating to have this barrier in the way of me being able to say without reservation that my backup replicates my filesystem exactly, and that I will be able to restore from my backup without causing any problems.

Overall, then, I'm a big fan of the idea that --hard-links (or if not, then some additional flag) should make the target directory look exactly like the source directory in link structure.