Bug 3392 - fuzzy misbehaving if source is a file
Summary: fuzzy misbehaving if source is a file
Status: CLOSED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 2.6.6
Hardware: All Linux
: P3 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-10 10:02 UTC by Egmont Koblinger
Modified: 2009-11-12 22:16 UTC (History)
0 users

See Also:


Attachments
main.c from my custom rsync, showing rewritten get_local_name (32.31 KB, text/x-csrc)
2006-01-10 17:22 UTC, Matt McCutchen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Egmont Koblinger 2006-01-10 10:02:54 UTC
I run rsync 2.6.6 on both the server and the client and perform a download
rsync --fuzzy --other-options... rsync://some/url /local/directory
Both the remote and the local URLs are absolute paths.

If the remote URL is a directory and I perform recursive synchronization
(e.g. "-a") then the --fuzzy option works perfectly just as I expect it,
and it co-operates nicely with the --compare-dest or --copy-dest option.

However, if the remote URL is a single plain file, then --fuzzy misbehaves.
No matter what the remote or local path is, no matter if I specify a
--compare-dest or --copy-dest option or not, no matter what its value is,
these are all completely ignored, and the local file with the most similar name
is searched in the current directory of the rsync process. This is seen from
strace's output (only "." is opened as a directory), seen from "skipping
directory xyz" messages that mention the subdirectories of the current dir,
and seen from the fact that if I place the similar file here then rsync is
much faster (i.e. it finds it here).

Instead, rsync should look for similar filenames under the target directory
(the directory component of the local path given in the last argument), or
under the directories given by --compare-dest or --copy-dest.

(Also, I think that generally if only absolute paths are given to rsync, it
should do nothing with the current directory, it should be irrelevant what it
is.)
Comment 1 Matt McCutchen 2006-01-10 17:22:24 UTC
This behavior is a consequence of the strange logic in get_local_name in main.c.  If the destination path is given as a file, then rsync uses a "local name" and accesses the destination file by its full path rather than first changing to the containing directory of the destination file.

When I was writing my custom rsync, I found that the behavior of get_local_name fouled up default ACL observance, so I rewrote and heavily commented get_local_name.  The upshot is that my rsync changes to the containing directory when receiving no matter what.  Please consider making the same change in the official rsync.  This change may help the situation with --fuzzy to some degree, but issues remain, such as whether the basename of the source path or the destination path is used to search for a fuzzy basis file.

My custom rsync is available here:
    http://mysite.verizon.net/hashproduct/myrsync/
Comment 2 Matt McCutchen 2006-01-10 17:22:36 UTC
Created attachment 1661 [details]
main.c from my custom rsync, showing rewritten get_local_name
Comment 3 Wayne Davison 2006-01-15 00:27:25 UTC
As Matt noted, the fuzzy option was expecting the current directory to be the parent directory of the destination file, and this wasn't true for a single file being copied to a new name.

I have checked in a fix that makes rsync always push_dir() into the destination file's parent directory.  (Thanks for the attachment, Matt -- I used some of your comments and the general logic from get_local_name(), though I rewrote it.)

One other nice side-effect is that rsync gives a better error now if you copy a single file to a /totally/bogus/path/name.
Comment 4 Matt McCutchen 2006-01-15 08:19:45 UTC
Nice.  It occurs to me that maybe the first call to do_stat should be changed to link_stat(dest_path, &st, keep_dirlinks) in order to obey --keep-dirlinks when finding the top-level target directory.
Comment 5 Egmont Koblinger 2006-03-13 07:12:28 UTC
Reopening, since it's not okay in 2.6.7 (though definitely different than it
was in 2.6.6).

In 2.6.7, the --fuzzy option causes search for simlar filename in the
target directory of the new file. This still means the value of the
--compare-dest or --copy-dest option is ignored. Rsync should search for
similar files in the directory specified by --co{mpare,py}-dest.
Comment 6 Wayne Davison 2006-03-13 09:57:13 UTC
Fuzzy is already a very expensive operation, and making it even more expensive is not a good idea, IMO.  I want to leave it as it has always been defined: performing a fuzzy search in the destination directory.
Comment 7 Egmont Koblinger 2006-03-13 10:32:14 UTC
The fuzzy option is much less expensive than downloading a file from scratch
instead of using a similar local file to compare to.

My goal is, by the way, not more than maintaining rsync support for apt (the
front-end for dpkg), using up-to-date tools and possibly mainstream solutions
(that is, as few patches as possible). Please read the original report here:
http://lists.debian.org/debian-devel/2003/07/msg00462.html

The whole design of "apt" forces me into an environment where I have to
download the new package into a different directory than where old packages
reside, but I still want to take use of the fuzzy option so that people don't
have to download 120 MB once a typo is fixed in the openoffice package.

With rsync 2.6.6, despite the bug I reported, I still had an extremely easy
workaround, I just had to put a chdir() call between the fork and exec in apt.
Due to the fact that this "bug" of rsync-2.6.6 is "fixed" now, it still
doesn't work out of the box, but at least now I don't even have such a simple
workaround. So after all the situation became worse.

Please read the linked mail and try to understand my needs and its reasons
(which are not artificial, they were brought up by the real world).
I hope you will understand why it would be so important for all the users of
our distribuion that these two options worked together perfectly. But even
if I failed to convince you, I'm reopening this bug since in this case rsync
should explicitely refuse the --compare-dest and similar options with an error
message if --fuzzy is also specified.
Comment 8 Wayne Davison 2006-03-13 10:57:28 UTC
The fuzzy algorithm is very expensive the more files it rates, so in larger transfers it would balloon into way too many fuzzy computations.
Comment 9 Wayne Davison 2006-03-13 11:20:59 UTC
One way to manualy optimize a transfer for a new directory that is related to an old directory is to first copy the old directory using --link-dest, and then copy the new directory using --fuzzy:

rsync -av --link-dest=../olddir olddir/ dest:newdir
rsync -av --fuzzy --delete-after newdir dest:

The first rsync run will just hard-link everything into the newdir destination as long as you have a local copy of the identical olddir files.
Comment 10 Egmont Koblinger 2006-03-14 05:46:36 UTC
Dear Wayne,

To comment #8:

If you'd say "it's too hard to fix it", "we have no resources to implement
this" or something similar, then I'd most likely accept it. But your arguments
of fuzzy being expensive is plainly bullshit, for two reasons:

First, I don't know if you have used this feature or not, I have used it, and
it saved me many hours by being able to synchronice 2 GB of data behind an
1Mbit/sec ADSL line in half hour rather than in 4.5 hours. And if I save 4
hours in my life then I really don't care whether rsync needs 1 second or
1 minute of CPU time. Actually I never had a noticeable load caused by rsync.
I hope you agree that 4 hours of wall clock time is much more expensive than
several seconds (or maybe a few minutes) of CPU time.

Second, the number of fuzzy computations doesn't depend on whether you compare
to the file listing of the same directory, or to the file listing of another
directory. So if we'd accept your argument that fuzzy needs too much CPU
resources, then the whole fuzzy option should completely be dropped from rsync,
even if someone wants to use it without the --copy-dest option. I hope this is
not the way to go.

So we're talking about two features which should be completely orthogonal to
each other, but still they don't work together, for apparently no sane reason.
This is just as stupid as if, let's say, you would be unable to preserve
permissions (-p) when --delete is in effect.

To comment #9:

I'll try it but I'm not sure it will be as simple as you imagine it. Apt
performs other operations on its cache, for example moving a file from
"newdir" to "olddir", and I don't know if it will fail if a hardlink is
already present there.
Comment 11 Wayne Davison 2006-03-15 10:27:39 UTC
I appreciate than in your specific circumstance that an enhance behavior of --fuzzy would be useful. My testing of the --fuzzy option has also included large transfers with many missing files where it would be a huge detriment. For instance, I have tried to use --fuzzy to transfer a large Maildir hierarchy, and the copy into a very large and active folder was so CPU intensive that it bogged the transfer down instead of speeding it up. This is because every missing file requires its own separate fuzzy computation, and that computation gets slower the more files there are in the destination directory. So, the --fuzzy option is already too CPU intensive for its own good in some circumstances.

Given that, I don't want the file-set that fuzzy compares against to be made larger by default.  It could possibly be made an optional behavior, e.g. requested by doubling the --fuzzy option, but that suggestion would best be made in a separate enhancement request (since it's best to target a bug-report at a specific issue, and this specific issue of the wrong directory being scanned for fuzzy matches has been fixed).

Finally, some discussion of bug-tracking netiquette: please note that reopening a bug report twice in a row is considered to be a very rude action. While that might not make much sense logically, rudeness is more emotional than logical. I was once chastised for reopening a bug just once to ask a question that I thought might not have been considered in the closing of the bug. At the time I thought that the fellow was being overly dramatic, but after some experience of being on the other side, I realized that the reopening action actually conveys meaning that may not be intended by the reopener. Thus, a good rule of thumb when dealing with a bug that is not in obvious need of being reopened is this:

Add a comment to the closed bug raising the issue. Check for a concensus for reopening or starting a new bug report.  If there is no response, reopening might be needed to ensure that the issue is not forgotten.

OK?  Thanks for you input, and feel free to open an enhancement request for the --fuzzy option if you'd care to do so.
Comment 12 Egmont Koblinger 2006-03-16 04:52:04 UTC
I do agree that there are cirsumstances when fuzzy doesn't speed up things,
or actually causes huge performance regression. On the other hand, as I
described, there are also cases when it really helps a lot. It should be up to
the users to choose whether to use it or not.

You said:
> I don't want the file-set that fuzzy compares against to be made
> larger by default.

I perfectly agree. I never talked about the _default_. I talked about the case
where --compare-dest=... is also specified in addition to --fuzzy. This is
clearly not the default, in this case the user explicitely asks for a larger
file set to be fuzzy-compared to, hopefully knowing its pros and cons.

About netiquette:

I have seen several netiquettes, mostly about e-mail, but I can't remember
seeing a bug-reporting netiquette anywhere. Please point me to an URL, I'll be
happy to read it and follow its guidelines.

First I reopened the bug since it was closed falsely, the bug mentioned in the
original report is _not_ fixed (comment #5).

For the second time (comment #7) I reopened it to (1) note that rsync doesn't
mention its current behavior in its docs and doesn't refuse options that don't
work together, and (2) to state that your arguments containing "IMO" and
"I want to leave it..." are not quite strong arguments. For (1) I could have
opened another report, I think it's quite a matter of taste. Some prefer to
split each and every single step of a bigger problem set to a different issue,
some rather want to see "co-operation of --compare-dest and --fuzzy" as one
large bug that says: all details of this problem set should be solved. Seems
that you belong to the first group, while I belong to the second one. Note
that opening a new issue has at least one drawback: the cc list is lost.
I think anyone who was interested in the first report is still interested in
how fuzzy and compare-dest will work together.

By the way: there are other ways to close a bug, there is INVALID, there is
WONTFIX etc. These are not in netiquette, these are in the manual and UI of
bugzilla. Still you chose FIXED. Why? Please read the _original_ report once
again. It is _not_ fixed.

The third time I commented (comment #10), _before_ you told it's rude to reopen
bugs, I did _not_ reopen it, neither do so now.

Do you think that closing a bug twice in the middle of a conversation, while
that bug is not yet fixed, is not a rude action at all? Especially in
comment #6 where you had no real argument at all, just a personal taste?

> Thanks for you input, and feel free to open an enhancement request for the
> --fuzzy option if you'd care to do so.

So, if rsync doesn't work the way it should is not a bug, fixing it is only an
enhancement request? And if I open a new bug then it is likely to be fixed
(ohh, sorry, implemented as a new feature) but if I tell about it here then it
will be forgotten? Despite that I still do not ask for anything more than to
fix what I reported in my _original_ submission in this topic? Shall I really
open another report of the same problem that I originally reported here?

Sorry, I got tired of it. I'm happy to help the free software community
anywhere, sometimes with bug reports or enhancement requests, sometimes with
patches, sometimes with new sourcecode, with translations, or a lot of other
stuff, but only as long as I don't keep on hitting walls during my work.
I have no time and no power to try to fight against people who try to offend
my requests or my work. It's clear that you do not want to fix it. I cannot
force you to do. So just don't fix it. Leave it as it is now. Leave it as
CLOSED _FIXED_ (khmmm). And let's forget this whole issue...
Comment 13 Matt McCutchen 2009-11-12 22:16:03 UTC
I rediscovered this bug just now and thought I would comment on what happened so it doesn't stand as a blemish on the community.

Wayne and Egmont disagreed on the merits of the request for --fuzzy to search --*-dest dirs.  That's perfectly normal.  What isn't normal is the miscommunication about the status of the request.  Wayne wanted it to be filed as a separate ticket, a decision that I would generally grant developers the prerogative to make, but he never came out and said so explicitly until comment #11.  Egmont considered it to be permanently within the scope of this ticket because it was stated in comment #0 and perceived that Wayne was simply trying to bury the issue, hence the reopen battle and the dissatisfaction with the FIXED resolution as expressed in comment #12.

Wayne, if you had stated your desire to separate the issues in comment #6, the onus would have been on Egmont to file the enhancement request separately before proceeding, and the discussion could have proceeded there, perhaps with fervent disagreement and a WONTFIX decision but without acrimony.  I would encourage both reporters and developers to avoid a repeat of this situation by considering, before a repeated reopen or resolution, whether there might be a simple miscommunication at fault.

Note that I went ahead and entered the enhancement as bug 4056 at the time, and it remains open.