Bug 14109 - Support Custom Fuzzy Basis Selection Algorithm
Summary: Support Custom Fuzzy Basis Selection Algorithm
Status: NEW
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-01 22:55 UTC by Lonnie Best
Modified: 2019-09-01 23:51 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lonnie Best 2019-09-01 22:55:39 UTC
The --fuzzy argument does an incredible job at syncing large files when it chooses the correct fuzzy basis.

However, the default "fuzzy-basis-destination-file-selection algorithm" is not correct for every situation, so I propose the ability to pass an argument to the fuzzy parameter that specifies which "fuzzy-basis-destination-file-selection algorithm" to use.

I've posted a question detailing my needs here:
https://unix.stackexchange.com/questions/538548/

In short, some of the files in my source-folder are 200GB in size. When rsync chooses the correct existing-destination-file for its "fuzzy basis", my synchronization (of these files) seems magical in term of the data that gets transferred over the wire.

However, when it chooses the wrong existing-destination-file as the source file's fuzzy basis, the data transfer can take days.

Look at the filenames in both my source-folder an destination-folder (below):

	# Source Folder's new files (from today's on-site backup):
	file100-2019_09-01_12am.log
	file100-2019_09-01_12am.lzo
	file101-2019_09-01_12am.log
	file101-2019_09-01_12am.lzo
	file102-2019_09-01_12am.log
	file102-2019_09-01_12am.lzo

	# Destination-Folder's old files (from yesterday's off-site backup):
	file100-2019_08-31_12am.log
	file100-2019_08-31_12am.lzo
	file101-2019_08-31_12am.log
	file101-2019_08-31_12am.lzo
	file102-2019_08-31_12am.log
	file102-2019_08-31_12am.lzo

In my case, the fuzzy-basis-selection-algorithm needs to select the existing destination-file that:

1) Has the same file extension as the source file
2) Begins with the most consecutively identical characters as the source file

The default algorithm does not meet these requirements.

Therefore, I propose the ability to pass an argument that allows the user to specify non-default fuzzy basis selection algorithms.

There should probably be a few common, baked-in ones (as time goes on) that you can choose from by name and it would be even more flexible if rsync also permitted the user the ability pass a file into the command that specifies a custom "fuzzy-basis-destination-file-selection algorithm".

Naturally, if these features are granted, the documentation would also need to be update to give guidance on specifying these things.

If these things are already implemented, and I have somehow overlooked them, would you kindly post an answer to my question here?:
https://unix.stackexchange.com/questions/538548/
Comment 1 Kevin Korb 2019-09-01 23:15:33 UTC
Just a quick thought on a workaround...

It would be trivial to figure out the new name and best old file in a script.  So, you could hard link the best old file to the new file name.  Then rsync wouldn't even need --fuzzy to find it.
Comment 2 Lonnie Best 2019-09-01 23:51:37 UTC
Thanks. Yeah, that's probably what I'll do. I may even write the script where it does some tasks parallel (running multiple rsync commands at the same time).

The current default "fuzzy-basis-destination-file-selection algorithm" selects the correct file most of the time. Maybe the reason it didn't today is because it is the first day of a new month and that made the file names be too different. I'm not sure.

The --fuzzy argument is really awesome and it is just a hair away from being exactly what I need for handling things with one command at the folder-level. If I could only modify the file-selection algorithm, it would be perfect.

Until then, I just have to write a script instead of being able to handle this within the command.