Bug 4128 - Make --ignore-times work better with --link-dest by adding an after-transfer check
Summary: Make --ignore-times work better with --link-dest by adding an after-transfer ...
Status: ASSIGNED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.9
Hardware: x86 Linux
: P3 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on: 5583
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-27 19:49 UTC by Joachim Wagner
Modified: 2008-08-18 09:00 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joachim Wagner 2006-09-27 19:49:02 UTC
Hi,

I checked the following with rsync 2.6.8 from Fedora Core 5 (updated this week) and the current 2.6.9cvs. The behaviour is different, but still not as expected. The man page says that --ignore-times switches off any quick checks. Therefore, I concluded that this option makes sure that target data is correct in any case. Elsewhere, I even read that --ignore-times is an alternative for --checksum and that one or the other can be prefered depending on how many files are expected to be the same. However, when used together with --link-dest, the following happens.

In 2.6.8, --ignore-times with --link-dest doesn't identify at all that files on the receiver can be hard-linked. This problem is known to the developers as revision r 1.273 of generator.c attempts to fix it. At least, rsync 2.6.8 uses the files in the link-dest directory to reduce network traffic, basically copying the while whithin the receiver.

With 2.6.9cvs (2006-09-27, generator.c revision r 1.285), the same options cause the files to be hard-linked without being verified to have identical content. (Note: I installed rsync locally on both machines (configure --prefix=$HOME) and used option --rsync-path=$HOME/bin/rsync, see below.)

Test script: Two machines A and B, Same user + numeric IDs.
(I used Pentium 4 PCs with Fedora C5, updated with default repositories). 

export B=192.168.0.20   # <-- set this to the 2nd machine to be able to copy and paste from here (you might also want to configure ssh to avoid typing login passwords again and again)
# step 1 - prepare data
echo "one" >test1.txt
echo "two" >test2.txt
mkdir -p ref/data
cp test1.txt ref/data/text101.txt
cp test1.txt ref/data/text102.txt
touch -d 060927 ref/data/*
mkdir data
cp test1.txt data/text101.txt
cp test2.txt data/text102.txt
touch -d 060927 data/*
mkdir dst
rsync -av ref dst `whoami`@$B:./
# step 2 - test rsync
rsync -av --ignore-times --link-dest=../ref/ data `whoami`@$B:dst/
# note: dest=ref/ would be relative to dst/
# note2: if you had to type in a password for the first rsync,
# copying'n'pasting the 2nd rsync might not have worked in one go
# step 3 - analyze results
ssh `whoami`@$B 'ls -li dst/data/ ; cat dst/data/*'
# note: 3rd column gives the hard-link count

# cleaning up for next run
ssh `whoami`@$B 'rm -f dst/data/*'
# test newest version
$HOME/bin/rsync -av --rsync-path=$HOME/bin/rsync --ignore-times \
--link-dest=../ref/ data `whoami`@$B:dst/
ssh `whoami`@$B 'ls -li dst/data/ ; cat dst/data/*'

Of course, it can be argued that the conclusion is wrong and the long description in the manpage missleading. --ignore-times simply does what is says: it ignores time stamps. However, the consequences when used with other options should be reasonable, or at least be documented.

Motivation of combining these options: Machine B is a mirrow of machine A. Unfortunately, machine A turned out to have had a hardware defect that causes sporadic read errors. Files on B are likely to be damaged. Files on A might also be permanently damaged. For further analysis, I'd like to have all files on B. Without --link-dest, I don't have enough space. Without --ignore-times, files with same stat but with a bit error somewhere in the middle will not be detected.

I'll now reconsider using --checksum although it seems to waste lots of time by calculating checksums sequentially first on machine A while B is idle, then, presumably (didn't get this far as I got impatient after 6 hours of high CPU and disk I/O load on A) on machine B while A is idle, to eventually apply the normal rsync algorithm on those files that are not identical. But this is a different story.

Regards,
Joachim
Comment 1 Joachim Wagner 2006-09-28 04:44:18 UTC
Workaround
==========

For those who find this report while searching for a solution:

# clean up previous experiment
export B=192.168.0.20
ssh `whoami`@$B 'rm -f dst/data/*'
# run rsync with (default) quick check
rsync -av --link-dest=../ref/ data `whoami`@$B:dst/
# find out which files are wrong
md5sum data/*
ssh $B
md5sum dst/data/*
# delete these files on B
rm dst/data/text102.txt
exit
# rsync again without --link-dest to fill gaps
rsync -av data `whoami`@$B:dst/
# everything is now fine:
ssh `whoami`@$B 'ls -li dst/data/ ; cat dst/data/*'

Notes:
 * login on B in a 2nd terminal to calculate checksums in
   parallel
 * if there is a risk that file differences are hand-crafted
   to be invisible to MD5 (in recent years feasible ways of
   doing this have been published), you should use slower but
   more secure sha1sum or sha512sum
 * for more than just a small directory, use something like
   find -type f -print0 | xargs --null md5sum | sort -k2 >B.md5
   (see also "! -links 1" in 2nd bullet point below)
 * diff --speed-large-files A.md5 B.md5 should do fine in most
   cases to identify the differences
 * the only advantage over rsync --checksum is parallel checksum
   calculation; we are still wasting time on files that failed
   the quick check (time + size); improvement: generate file list
   with find -type f ! -links 1 -print0 on machine B and copy
   it to A (! means "not" for find)
   
BTW: I can confirm that rsync --checksum would work as expected:
ssh `whoami`@$B 'rm -f dst/data/*'
rsync -av --checksum --link-dest=../ref/ data `whoami`@$B:dst/
ssh `whoami`@$B 'ls -li dst/data/ ; cat dst/data/*'

Have fun,
Joachim
Comment 2 Wayne Davison 2006-09-30 10:34:14 UTC
The --link-dest option only links files together that are found to be identical during the pre-transfer identicality checking, never as an extra check after a file has been updated (though it would be nice to add that as an improvement at some point, it doesn't work that way at present).

So, I think the new behavior in CVS is misguided in its attempt to make the --link-dest option play nice with the --ignore-times option.  I've removed that code and also added a mention to the --link-dest section that --ignore-times will prevent any hard-linking from occurring (though it will use the hierarchy to make the transfers more efficient).

I'll turn this bug report into a feature request:

It would be nice if the receiver would notice that it copied 100% of the data from a --link-dest basis file to the destination temp file while also having all preserved attributes the same.  Such a file would get its temp file dropped and the destination file hard-linked to the --link-dest basis file.
Comment 3 Wayne Davison 2006-09-30 10:47:50 UTC
Let me comment about the checksum slowness:

There is a diff in the patches dir named early-checksum.diff that makes the receiver do its checksumming at the same time that the sender is doing its checksumming.  I'm considering including this patch, but I need to do some more performance testing first.
Comment 4 Matt McCutchen 2006-09-30 11:30:07 UTC
(In reply to comment #2)
> It would be nice if the receiver would notice that it copied 100% of the data
> from a --link-dest basis file to the destination temp file while also having
> all preserved attributes the same.

Along the same lines, if rsync notices that the temp file has the same data as the existing destination file, it could discard the temp file and instead tweak attributes of the existing destination file as it would have if the quick check had passed.  This way, if I wished to copy into a destination in which make is being used, I could disable --times, and the mtime of a destination file would only be hit if its data actually changed.
Comment 5 Matt McCutchen 2008-08-18 09:00:42 UTC
The post-transfer tweaks or --link-dest checks discussed here would be based on the identical-data check of bug 5583.