Bug 13071 - [PATCH] Transfer a resumed --partial-dir file in-place
Summary: [PATCH] Transfer a resumed --partial-dir file in-place
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.2.0
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-04 18:17 UTC by Ben RUBSON
Modified: 2020-05-01 15:49 UTC (History)
1 user (show)

See Also:


Attachments
Allow partial-dir with inplace (5.46 KB, patch)
2017-10-04 18:17 UTC, Ben RUBSON
no flags Details
A simpler patch that just uses in-place updates of partial-dir files when possible (8.92 KB, patch)
2020-04-26 19:47 UTC, Wayne Davison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ben RUBSON 2017-10-04 18:17:36 UTC
Created attachment 13649 [details]
Allow partial-dir with inplace

Hello,

Here is a patch to allow partial-dir with inplace.

Why ?

Quoting modified man-page :
If --inplace is used with --partial-dir, or with any option implying it (--delay-update), the inplace behavior will take place in the partial directory, each file being moved to its final destination once completed.

It really helps when one want to keep partial files but don't want rsync server to duplicate them remotely when transfer is restarted. Partial files will then be used "inplace", in the partial directory.
Really useful in conjunction with for example --link-dest.
More generally this helps saving space / dealing with short free-space targets.

Could we then think about reviewing and merging this please ?

Thank you very much !

Ben

(this patch has been written so that current client versions can easily benefit from this using --temp-dir to indicate their partial directory, server code being really easy to modify to take tmpdir as partial_dir)
Comment 1 Ben RUBSON 2018-02-13 20:15:31 UTC
Tested and still correctly works with rsync 3.1.3.
Thx !
Comment 2 Ferran 2019-04-10 19:01:41 UTC
I've come across the very same idea exposed in this request and was about to code a patch for it. A quick search in the bug tracker showed up this patch, which looks very sensible to me, and I would like to show my support for it.

Aside from the clear space-saving benefits, being able to transfer files directly into the partial directory will speed-up resuming an interrupted transfer by saving an unnecessary copy (from the partial dir to the .hidden.file.hgsxjk). Saving this copy can make the difference when the files are large.

However, I don't get the benefits of using this option in conjunction with --link-dest, could you give us some more detail?

I've just tested the patch myself in the development version (3.1.4dev) and the feature works as expected. The patch looks quite simple and self-contained, and doesn't seem to interfere much with other features. Is there any reason (aside from time limitations) why this patch hasn't been landed yet?

Thanks!,
Ferran
Comment 3 Ben RUBSON 2019-04-14 14:35:14 UTC
Thank you for your feedback Ferran !

I'm not sure about why this patch hasn't been landed yet :)

Regarding benefits of using this option in conjunction with --link-dest.
Well, certainly same benefit as without :)
But imagine you are in a daily backup scheme, using --link-dest.
Everyday you have a new timestamped backup directory.
Day n, backup window is over, but you did not finish to transfer a huge file.
Thanks to this patch, day n+1 will resume the huge file flawlessly, inplace.

Hope to see this merged :)
Comment 4 Ben RUBSON 2019-10-03 14:34:11 UTC
Hi Wayne,
Let's review & merge this new feature ?
Thank you very much !
Comment 5 Wayne Davison 2020-04-26 19:46:49 UTC
I dislike confusing what --inplace means. While an expert will know that it changes meaning in different circumstances, I think it best that it always means that the final destination file should be updated in-place, and thus rsync should tell the user when it can't be used in combination with other options.

As for the changes in this patch, I see 2 aspects of its interaction with the partial-dir.  First, what happens when transferring a file that has a partial file that already exists? And second, is the partial-dir used to store all the temp files that are created for a transfer rather than the normal tmp file being created and the partial-dir only getting created if the transfer is interrupted?

As for the first part, I'm thinking that it would be fine if we change how a partial-dir file is handled to always update it in-place, especially since there is a good chance that the transferred file's early data is OK. The hard part of that is that the sender needs to know which files are in-place transfers so that it can restrict the checksum matching algorithm, so we must know that the sender is new enough to respond to a FNAMECMP_PARTIAL_DIR value as an indicator that the file is going to be handled in-place.  This could be done via a compat flag character sent via -e or by forcing it to be an option that an older sender would reject as unknown.

As for the second part, I'm not sure how much of a need there is for changing the default behavior of the tmp file creation, and would probably not include that change.  Also, the patch has an error in it where files in a subdir of the transfer do not get created if the partial-dir is a relative path instead of an absolute path (this is due to its use of "dummy" in the call that creates the dir).

One interesting aspect of the patch is that it allows the --append option to be used on partial files.  This could be useful if you're 100% certain that the huge partial file is still a valid prefix of the source file.  However, specifying --append with a --partial-dir would also affect how non-partial-dir files get transferred, and is thus not a good way to affect just the continuing of a partial-dir file's contents.  If we want appending of partial file data, it would be better to add a --partial-append option and have both sides only apply append behavior to the FNAMECMP_PARTIAL_DIR transfers.

Finally, since this patch sets the inplace flag for all files, it also has the side effect that even non-partial files get the limited checksum matching rules, which can make some transfers of shifted data much less efficient.

I'm looking at a simpler change which I will attach as a patch. I would appreciate some feedback if the simpler behavior is enough for what you wanted from this change.
Comment 6 Wayne Davison 2020-04-26 19:47:51 UTC
Created attachment 15941 [details]
A simpler patch that just uses in-place updates of partial-dir files when possible
Comment 7 Ben RUBSON 2020-04-27 17:13:13 UTC
Wayne, thank you very much for your detailed answer and for your patch proposal.
Let me go through this and come back to you shortly.
Comment 8 Ben RUBSON 2020-04-27 21:08:58 UTC
Wayne, I then just tested your patch, and it works as expected.

One thing perhaps, just to be sure.
I used an absolute partial-dir path during my tests.
It is not deleted after the partial file has been completed and moved to its destination place.
Is it intended ?
Perhaps you assume that this absolute partial-dir could contain other partial files, or even non rsync-related files ?
I'm OK with this behavior, IMO more secure :)

Your patch then re-creates the behavior I was looking for / proposed through my patch.
Mine was intended to make this feature "compatible" with older clients, which made it a little bit more complex...
Yours is more simple, which is better, perfect then, many thanks for having looked into this !

To go further, some reflexions.
Let's imagine we use --link-dest and --partial-dir, with this current patch.
File2, size 100MB, is sending, while comparing to a nice basis File1 (let's imagine almost perfect matching) found in one of the link-dest directories.
Transfer aborts.
Partial File2, 10MB, is moved into partial-dir.
Transfer restarts, and File2 is sent again, this time comparing to the partial File2 (yes this is what we want here, perfect).
But then, perhaps we could have been faster comparing again to File1 instead ? :)
Hard choice, and there's no way to know...
I tried to address this in the following patch :
https://bugzilla.samba.org/show_bug.cgi?id=13083
In this situation, the patch takes both files as basis files, which then can speed-up the transfer.
Of course it depends on files' size, checksum speed, remaining data to send etc...
The main avantage is in the situation above, where there's a lot of remaining data to send.

Thank you again !

Ben
Comment 9 Wayne Davison 2020-04-29 23:56:31 UTC
Yes, an absolute-path partial dir is not removed by rsync.  I'll add a mention of that to the manpage.

I'm committing the simple version of the patch.  Thanks for checking it out!

As to the question about which basis file is better to use when a partial file exists, that is indeed a hard one to answer since it depends a lot on how much of a transfer completed and how much matching data exists in each file.  So far I've left it up to the user to remove a partial file that would make their transfer worse off, but it might be possible to come up with a simple heuristic when creating the partial file to guess if we think we should save it or not -- something based on the partial file size vs final file size and also how much of its data was copied literally vs copied from the current basis file.  That seems like it would be pretty useful to me, especially if it erred on the side of only removing the partial file if it was very clear it was not going to be useful.
Comment 10 Ben RUBSON 2020-05-01 15:49:58 UTC
Many thanks Wayne for the merge !

Your idea about trying to guess, when current transfer aborts, whether or not the partial file will help for the next transfer more than the basis file, sounds rather good.

But the heuristic might not be so easy.

There's almost no chance for a transfer to abort after having copied matching data only (in this case we would simply discard the partial file).
As you said we would then have to consider amount of literal data.
If we discard a partial file containing literal data, we must be sure that, at the end of the next transfer, we will not be in the same situation / we will have made some progress. Otherwise there's the risk to loop again and again with this same file.
And if we don't discard the partial file, because we think it contains enough literal data, we will literally copy the rest of the file, whereas perhaps almost of this data was in the basis file. Leading to longer transfer (up to some nights instead of some minutes).
I think for example about huge files, where the total % of literal data is rather small, but which takes time to copy. And of course this literal data could be anywhere in the file, at the begining or at the end... This is the extreme case I agree...

This is to address this sort of use case that I used both partial+basis in the linked patch above.
It works fine, but implementation might not be the best one (especially with its "dummy 0 blocks", might be avoided if sender properly knows that 2 basis are used).
Of course this solution has drawbacks also, especially with huge files when transfer is almost achieved. We will then checksum 2 huge files, which takes time and resources.