10353 – link-by-hash collision detection

Bug 10353 - link-by-hash collision detection

Summary: link-by-hash collision detection

Status:	RESOLVED WONTFIX

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	3.1.1
Hardware:	All All

Importance:	P5 enhancement (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-12-30 17:25 UTC by Jim Klimov
Modified:	2014-01-19 22:43 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jim Klimov 2013-12-30 17:25:31 UTC

The link-by-hash should include a mode to verify that the original file content is indeed identical to the content of the file into which it might be hardlinked per the hash value.

If the hash algorithm happens to be weak (allowing two files of the same size with same hash and different content - i.e. a hash collision), the hash-filenames should include a unique suffix (i.e. 123abcd.1024.0;1 and 123abcd.1024.0;2 to differentiate two files with different contents), and if such filename patterns exist - all copies should be considered for link-by-hash deduplication.

Comment 1 Wayne Davison 2014-01-19 22:43:46 UTC

I added the size to the filename to avoid having to worry about this -- a hash conflict with the same file size is very (very) unlikely, and forcing the code to compare file contents before figuring out which hash conflict is the right one is super slow.