Bug 10244 - link-by-hash patch: speed enhancement by hash calculation on source side
Summary: link-by-hash patch: speed enhancement by hash calculation on source side
Status: NEW
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All All
: P5 enhancement (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-03 19:21 UTC by Matthias Leipold
Modified: 2014-07-26 01:08 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Leipold 2013-11-03 19:21:50 UTC
The link-by-hash patch actually is working perfectly in reducing needed storage on the destination. But to do so changed/non existing/renamed/moved files are first transfered from source to destination and only then the hash of the file (the one for link-by-hash) is generated, the hash dir is checked and in case the file is already existing the file is replaced by a hard link to the hash dir.

In a setup of synchronizing two PCs/Servers via network (especially Internet) a lot of network capacity and time could be saved if the hash (for link-by-hash) would already be generated by the source side instance of rsync. This hash then could be send to the destination rsync to check if the file is already existing in the hash dir. In case the file existed only the hard link needs to be generated but no file transfer would be necessary.

If possible this would not only speed up the "file transfer" but also solve the problem of renamed and moved files (at least on a setup with link-by-hash)

Could you please check if the described setup could be possible.

Thanks in advance.
Comment 1 Dave Yost 2014-07-26 01:08:58 UTC
rsync --link-dest could try a bit harder to find candidates for a hard link.

I suggest an option to rsync that works like this when you give it a file size argument:

Before copying, on the destination end, rsync makes a list of large files, like this:
  find /path-to-link-dest/dir -size +100M

While copying, when rsync encounters a file that can't be linked normally, if the file is larger than the threshold, rsync tries to link it with a candidate from the list before giving in and making a new copy.

The threshold idea is to make rsync faster by not spending time on small files.

On the destination, rsync could use threads to overlap some of the computation.