Bug 10353 - link-by-hash collision detection
Summary: link-by-hash collision detection
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.1
Hardware: All All
: P5 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
Depends on:
Reported: 2013-12-30 17:25 UTC by Jim Klimov
Modified: 2014-01-19 22:43 UTC (History)
0 users

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Jim Klimov 2013-12-30 17:25:31 UTC
The link-by-hash should include a mode to verify that the original file content is indeed identical to the content of the file into which it might be hardlinked per the hash value.

If the hash algorithm happens to be weak (allowing two files of the same size with same hash and different content - i.e. a hash collision), the hash-filenames should include a unique suffix (i.e. 123abcd.1024.0;1 and 123abcd.1024.0;2 to differentiate two files with different contents), and if such filename patterns exist - all copies should be considered for link-by-hash deduplication.
Comment 1 Wayne Davison 2014-01-19 22:43:46 UTC
I added the size to the filename to avoid having to worry about this -- a hash conflict with the same file size is very (very) unlikely, and forcing the code to compare file contents before figuring out which hash conflict is the right one is super slow.