Bug 12530 - [REQ] Improve fuzzy using files being uploaded
Summary: [REQ] Improve fuzzy using files being uploaded
Status: NEW
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.2
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-19 08:05 UTC by Ben RUBSON
Modified: 2023-10-17 16:21 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ben RUBSON 2017-01-19 08:05:33 UTC
Hello,

Let's imagine the sender is uploading a bunch of files which are quite similar.
For example, the following dir :
/directory
|-backup1.iso
|-backup2.iso
|-backup3.iso
|-backup4.iso
|-backup5.iso

For the moment, if no remote fuzzy basis is found at the very beginning of the transfer, every file will be fully uploaded.
Goal would then be to improve rsync so that once the first file has been uploaded, fuzzy algorithm could look at this new file as a fuzzy basis file for the other new files arriving. Same thing once the second file has been uploaded etc...
Perhaps it could be done once for all at the very beginning of the transfer, also taking the list of files which will be uploaded (sent by the sender), and their properties, to feed the fuzzy algorithm.

This would speed-up transfer in a number of situations.

Thank you very much !

Ben
Comment 1 Ulrich Sibilller 2023-10-17 16:21:58 UTC
I go one step further than this: rsync should not only look for a file to reference in fuzzy mode but also take into account what it transferred previously. So instead of throwing away any information it gathered for the first file once it is done it could keep the transfer information and reuse it. It would then
a) automatically fulfill you request by having the information for the first iso already
b) not rely on similarity by size and/or name only but on the data itself!

Of course this would increase memory usage but that's something the user can decide if it is worth or not.