Bug 12570 - Problems with --checksum --existing
Summary: Problems with --checksum --existing
Status: RESOLVED LATER
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.1
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-07 12:19 UTC by atom
Modified: 2020-07-27 21:23 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description atom 2017-02-07 12:19:08 UTC
Problem:

I've got an sd-card with some movies, a few of which are corrupted files.

I want to copy only the files that don't match the good files.

command:
 rsync --checksum --existing -vhriP /movies/ /media/128-SD/Movies/

The problem here is that *all* files in "/movies/" are hashed before anything else happens. This can be verified with lsof: "lsof +D /movies".

I've got <100GB in "/media/128-SD/Movies/".

I've got >1.5TB in "/movies/", and hashing all of those files is just a huge waste of time and system resources.

When "--existing" and "--checksum" are both used, the algorithm should first make a list of candidate files, then start hashing. It should *not* start hashing everything on the send-side and then figure out which files might be needed.

Workaround for me:
 diff -r /movies/ /media/128-SD/Movies/ | grep differ | awk '{print "pv " $3" > "$5}' | sh

nb, that workaround requires "pv" and only works with file-names that do not contain spaces, but for me it's a quick and easy way to see progress while files are being copied. "cp" would work fine in place of "pv".

On my system, that workaround saved my about 1-2 days of hashing, and completed in less than an hour.
Comment 1 Kevin Korb 2017-02-07 14:32:09 UTC
Unfortunately rync's --checksum is just that dumb.  It checksums *EVERYTHING* on the source and the target before it does anything else.  Since --checksum is almost always the wrong thing to do nobody seems to be willing to add basic intelligence to it.  Unfortunately, what you are trying to do is one of those few instances when --checksum is the right thing to use.  So, that is just the way it works.
Comment 2 Haravikk 2019-11-03 12:37:48 UTC
I was about to post on basically the same issue, but found this; I use rsync to do a lot of incremental backups where ZFS or similar isn't an option (not that common, but still comes up now and then). To guarantee correctness I like to run a periodic consistency check with --checksum to be certain that none of the files have changed at rest on the receiver, just like how I scrub a ZFS pool from time to time.

Problem is that rsync's --checksum mode is insanely slow when done for a large amount of files, much slower than it should be, even allowing for a slow sender or receiver.


I had always assumed that rsync at each end just set about gathering metadata in the background, while communicating, "I have X with checksum Y" -> "I don't, send it" or such, but this doesn't appear to be the case with --checksum, as it can take hours before anything even *begins* sending, let alone the actual time to finish.

It seems a lot like the incremental file list behaviour of modern rsync is being disabled when --checksum mode is enabled, but is there any good reason why that should be the case?

I can't think of any reason why it should be different, as a checksum ultimately is just a value to be compared, just like a file-size and/or timestamp, it just takes a bit longer to generate each one.
Comment 3 Wayne Davison 2020-07-27 21:23:29 UTC
A future version of rsync will hopefully have a revised protocol that supports a 2-step checksum process.  In the meantime, limiting the file list is exactly the right solution.