Bug 5583 - Don't write out an unchanged file if all the checksums matched
Summary: Don't write out an unchanged file if all the checksums matched
Status: ASSIGNED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.9
Hardware: x86 Linux
: P3 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks: 3229 4128
  Show dependency treegraph
 
Reported: 2008-07-03 21:08 UTC by Luke
Modified: 2021-01-15 14:22 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luke 2008-07-03 21:08:41 UTC
I am using rsync to update compact flash cards and would like to minimise the cycles on them.  The cards contain root FSs for a number of identical robots that differ only in UUIDs, mac addresses, hostnames etc.  A large number of files are generated specially for the update (thus always have different timestamps to the existing files on the card) but almost always correspond exactly to the files existing on the CF card.  My rsync -i output is full of:

>f..T...... etc/hostname

And other similar files, where the only thing being changed is the transfer time, which I don't care about.

I accept that the files have to be transferred as the stamp is different, but I don't see the point of writing the file if none of the data has changed.

I spent a significant amount of time trying to find an option that would prevent this with no luck, apologies if I have overlooked something.

Finally, this is somewhat similar to bug 3229 but a little different in that it's not to do with the backup function.
Comment 1 Matt McCutchen 2008-07-03 21:31:42 UTC
Try --checksum.
Comment 2 Luke 2008-07-04 00:18:43 UTC
Thanks for your comment Matt, but --checksum takes in excess of a hundred times longer.  I cancelled it as I couldn't be bothered waiting.  It calculates the checksum of every file on the source system regardless of the size or timestamp before continuing.  It might be a solution if the possibility existed to only calculate checksums in the case of a timestamp difference, however as I say it seems this option does not exist.
Comment 3 Wayne Davison 2008-07-04 03:13:54 UTC
I have thought about trying to optimize out such a rewrite, and it is possible, but only by delaying the start of the receiver beginning its update.  This could slow things down if the file is actually different, but would speed things up if the files were really the same.  I can see two different places to put this logic:

One would be to have the receiver delay starting a temp file until it notices that the sender has told it about a changed part of the basis file.  At that point, it would need to create a temporary file, open the basis file, and do a basis copy from 0 to the current position, and then proceed normally with the reset of the copy.  However, if no difference was found, the update would not be needed, and would be discarded.  (One potential issue: the receiver would need to have a way to get the full-file checksum from the generator so that it could do a double-check against the sender's full-file checksum, since it will not have computed one.)

Another option would be to put the short-circuit into the sender's logic so that it doesn't tell the receiver to do anything until it first finds a file difference.  The protocol would be extended to have a way to convey to the receiver that the file doesn't need any updates (since the receiver probably needs to do its post-transfer attribute updating, and may need to notify the generator that the file is done).  We'd still need a solution to the full-file checksum verification.

One other option that is available now is to use one of the checksum caching patches from the patches directory (such as the one that caches file-info in a DB and associates the last-known attributes with a checksum, allowing rsync to more quickly notice when files are the same).
Comment 4 Henrik Langos 2009-11-13 04:49:40 UTC
Here's my "me too" comment on the issue (feel free to move it to a separate bug depending on this one):

I have stumbled upon the same issue in connection with rsnapshot and rsync with the "--detect-renamed" patch. 

Basically rsnapshot works like this: On the first run creates a full copy of a directory tree /src to /dst/0. Then the next time it rotates /dst/(x) to /dst/(x+1) and creates a copy with just hard links from /dst/1 to /dst/0 and then calls rsync to transfer the changes between /src and /dst/0, effectively creating a differential backup at the granularity of files.

I applied the detect-renamed patch to avoid multiple copies of big files when they are moved around in the directory tree. 

The patch works in so far as it finds the correct base files in /dst.
Then it uses the delta algorithm to make sure that no coincidental match of filename,size and mtime results in a false positive.

Unfortunately usage of the delta algorithm creates a new copy of the file at /dst even if the content is the same as the base file (instead of using a hardlink to the base file).