10244 – link-by-hash patch: speed enhancement by hash calculation on source side

Bug 10244 - link-by-hash patch: speed enhancement by hash calculation on source side

Summary: link-by-hash patch: speed enhancement by hash calculation on source side

Status:	NEW

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	3.1.0
Hardware:	All All

Importance:	P5 enhancement (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-11-03 19:21 UTC by Matthias Leipold
Modified:	2014-07-26 01:08 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Leipold 2013-11-03 19:21:50 UTC

The link-by-hash patch actually is working perfectly in reducing needed storage on the destination. But to do so changed/non existing/renamed/moved files are first transfered from source to destination and only then the hash of the file (the one for link-by-hash) is generated, the hash dir is checked and in case the file is already existing the file is replaced by a hard link to the hash dir.

In a setup of synchronizing two PCs/Servers via network (especially Internet) a lot of network capacity and time could be saved if the hash (for link-by-hash) would already be generated by the source side instance of rsync. This hash then could be send to the destination rsync to check if the file is already existing in the hash dir. In case the file existed only the hard link needs to be generated but no file transfer would be necessary.

If possible this would not only speed up the "file transfer" but also solve the problem of renamed and moved files (at least on a setup with link-by-hash)

Could you please check if the described setup could be possible.

Thanks in advance.

Comment 1 Dave Yost 2014-07-26 01:08:58 UTC

rsync --link-dest could try a bit harder to find candidates for a hard link.

I suggest an option to rsync that works like this when you give it a file size argument:

Before copying, on the destination end, rsync makes a list of large files, like this:
  find /path-to-link-dest/dir -size +100M

While copying, when rsync encounters a file that can't be linked normally, if the file is larger than the threshold, rsync tries to link it with a candidate from the list before giving in and making a new copy.

The threshold idea is to make rsync faster by not spending time on small files.

On the destination, rsync could use threads to overlap some of the computation.