Bug 12570 - Problems with --checksum --existing
Problems with --checksum --existing
Status: NEW
Product: rsync
Classification: Unclassified
Component: core
3.1.1
All All
: P5 normal
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-02-07 12:19 UTC by atom
Modified: 2017-02-07 14:32 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description atom 2017-02-07 12:19:08 UTC
Problem:

I've got an sd-card with some movies, a few of which are corrupted files.

I want to copy only the files that don't match the good files.

command:
 rsync --checksum --existing -vhriP /movies/ /media/128-SD/Movies/

The problem here is that *all* files in "/movies/" are hashed before anything else happens. This can be verified with lsof: "lsof +D /movies".

I've got <100GB in "/media/128-SD/Movies/".

I've got >1.5TB in "/movies/", and hashing all of those files is just a huge waste of time and system resources.

When "--existing" and "--checksum" are both used, the algorithm should first make a list of candidate files, then start hashing. It should *not* start hashing everything on the send-side and then figure out which files might be needed.

Workaround for me:
 diff -r /movies/ /media/128-SD/Movies/ | grep differ | awk '{print "pv " $3" > "$5}' | sh

nb, that workaround requires "pv" and only works with file-names that do not contain spaces, but for me it's a quick and easy way to see progress while files are being copied. "cp" would work fine in place of "pv".

On my system, that workaround saved my about 1-2 days of hashing, and completed in less than an hour.
Comment 1 Kevin Korb 2017-02-07 14:32:09 UTC
Unfortunately rync's --checksum is just that dumb.  It checksums *EVERYTHING* on the source and the target before it does anything else.  Since --checksum is almost always the wrong thing to do nobody seems to be willing to add basic intelligence to it.  Unfortunately, what you are trying to do is one of those few instances when --checksum is the right thing to use.  So, that is just the way it works.