Bug 10353 - link-by-hash collision detection
link-by-hash collision detection
Status: RESOLVED WONTFIX
Product: rsync
Classification: Unclassified
Component: core
3.1.1
All All
: P5 enhancement
: ---
Assigned To: Wayne Davison
Rsync QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-30 17:25 UTC by Jim Klimov
Modified: 2014-01-19 22:43 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jim Klimov 2013-12-30 17:25:31 UTC
The link-by-hash should include a mode to verify that the original file content is indeed identical to the content of the file into which it might be hardlinked per the hash value.

If the hash algorithm happens to be weak (allowing two files of the same size with same hash and different content - i.e. a hash collision), the hash-filenames should include a unique suffix (i.e. 123abcd.1024.0;1 and 123abcd.1024.0;2 to differentiate two files with different contents), and if such filename patterns exist - all copies should be considered for link-by-hash deduplication.
Comment 1 Wayne Davison 2014-01-19 22:43:46 UTC
I added the size to the filename to avoid having to worry about this -- a hash conflict with the same file size is very (very) unlikely, and forcing the code to compare file contents before figuring out which hash conflict is the right one is super slow.