The Samba-Bugzilla – Bug 10353
link-by-hash collision detection
Last modified: 2014-01-19 22:43:46 UTC
The link-by-hash should include a mode to verify that the original file content is indeed identical to the content of the file into which it might be hardlinked per the hash value.
If the hash algorithm happens to be weak (allowing two files of the same size with same hash and different content - i.e. a hash collision), the hash-filenames should include a unique suffix (i.e. 123abcd.1024.0;1 and 123abcd.1024.0;2 to differentiate two files with different contents), and if such filename patterns exist - all copies should be considered for link-by-hash deduplication.
I added the size to the filename to avoid having to worry about this -- a hash conflict with the same file size is very (very) unlikely, and forcing the code to compare file contents before figuring out which hash conflict is the right one is super slow.