2240 – Add last-match/short-circuit processing of include/exclude

Bug 2240 - Add last-match/short-circuit processing of include/exclude

Summary: Add last-match/short-circuit processing of include/exclude

Status:	CLOSED WONTFIX

Alias:	None

Product:	rsync
Classification:	Unclassified
Component:	core (show other bugs)
Version:	2.6.3
Hardware:	All FreeBSD

Importance:	P3 enhancement (vote)
Target Milestone:	---
Assignee:	Wayne Davison
QA Contact:	Rsync QA Contact

URL:	http://cvs.openpkg.org/getfile/openpk...
Keywords:

Depends on:
Blocks:

Reported:	2005-01-13 07:47 UTC by Ralf S. Engelschall
Modified:	2006-03-12 02:56 UTC (History)
CC List:	0 users

See Also:

Attachments
rsync.patch.lastmatch (5.45 KB, patch) 2005-01-13 07:48 UTC, Ralf S. Engelschall	no flags	Details
Add optional last-match parsing (1.28 KB, patch) 2005-02-25 10:00 UTC, Wayne Davison	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ralf S. Engelschall 2005-01-13 07:47:23 UTC

I've added a --last-match command line option to RSYNC 2.6.3 which
switches the include/exclude pattern processing from "first-match" (the
default) to "last-match" semantics plus the possibility to specify
short-circuit patterns (which stop processing immediately if matching)
with the operators "-!" and "+!" in addition to the regular operators
"-" and "+". The "last-match" is a super-set of the "first-match"
semantics, providing more flexible include/exclude specifications by
allowing arbitrary nesting of matches.

For patch see the OpenPKG repository:
http://cvs.openpkg.org/getfile/openpkg-src/rsync/rsync.patch.lastmatch

Comment 1 Ralf S. Engelschall 2005-01-13 07:48:21 UTC

Created attachment 887 [details]
rsync.patch.lastmatch

Comment 2 Wayne Davison 2005-01-13 14:26:12 UTC

I'm not totally sure what you mean by "arbitrary nesting of matches", but I'll
assume that you're talking about being able to override the rules of a
.cvsignore file (which is not currently possible) and to string together
multiple generic --include-from/--exclude-from files on the command-line and
have some of the files specify a rule that will override the rules of a later
file (or earlier, depending on the order of the scan). This is something that
was talked about before, and I think that it would be a nice idea to add
something like this to rsync. Your syntax choice looks like a good one.  (I had
previously suggested using "++ pattern" instead of "+! pattern", but I like "+!"
better).

This does mean that an incompatibility in the syntax of exclude files is
introduced (e.g. a file named "+! foo" was not previously special). I'm thinking
that it would be a good idea to make the code escape all names that start with
"+" or "-" (e.g. prefix "+ " or "- ", as appropriate). This would allow the
adding of any non-space suffixes to the initial "+" or "-" to change their
meaning without adding any new incompatibilities in the syntax of the
include/exclude files. I think I'll go ahead and add this to the code that sends
the name over the socket, as it does not interfere with backward compatibility
(and improves forward compatibility).

As for the order of the includes/excludes, the same logic can be implemented in
either order, so we need some other reason besides adding a new short-circuit
syntax to change it. For instance, your patch can be thought of as implementing
first-match on the short-circuit rules, and falling back to last match on the
rest of the rules (and .cvsignore files must get added at the bottom of the list
of rules). In the current rsync order this would be implemented as a priority
last-match of the short-circuit rules, followed by first-match of the normal
rules (and .cvsignore files continue to be added at the top of the list of
rules). So, a --last-match option should only be added if we wish to give the
user the option of writing their rules in the opposite order, and I'm not sure
we need that.

As for the implementation, I'd prefer to see one that doesn't always match every
name against every item in the list (if we can help it). We can do this by
adding a "previous" pointer to the exclude_struct so that it can be scanned in
either order. The code would then scan in one direction for just the
short-circuit rules (if any exist), and then fall back to scanning in the
opposite direction for normal rules. If the --last-match option was still
desired, I would make its only affect be to change the order of how the user's
items get put into the list (so that the same scanning code could be used for
both modes).

I'm currently considering some changes to the include/exclude code: namely a
modified version of the current merge-exclude-file.diff in the patches dir, but
with the syntax of the include-rule lines changed. Thus, the addition of a new
"overriding" include/exclude idiom would go well with this.

Comment 3 Gerald (Jerry) Carter (dead mail address) 2005-01-15 08:35:08 UTC

On Thu, Jan 13, 2005, samba-bugs@samba.org wrote:

>> I'm not totally sure what you mean by "arbitrary nesting of matches", but I'll
>> assume that you're talking about being able to override the rules of a
>> .cvsignore file (which is not currently possible) and to string together
>> multiple generic --include-from/--exclude-from files on the command-line and
>> have some of the files specify a rule that will override the rules of a later
>> file (or earlier, depending on the order of the scan).

Yes. Just a typical example. You have the following tree:

  dir1/           (+)
  dir2/           (+)
    dir21/        (-)
    dir22/        (+)
      file221     (+)
      file222     (-)
      file223     (+)
    dir23/        (-)
  dir3/           (+)

Now consider that everything should be synchronized with dir2/* excluded
except for dir2/dir22 and that dir2/dir22/file222 also should be
not synchronized. With the current "first match" approach shuch a
synchronization cannot be done at all in a generic way. The only
possibility would be to perform multiple individual synchronizations,
but this becomes impossible if at some intermediate level an abitrary
number files or dirs exist. With the "last match" approach such a
synchronization is trivial:

 + *
 - dir2/*
 + dir2/dir22/
 - dir2/dir22/file222

That's especially also the reason why mostly all modern packet filters
or similar access control mechanisms use the "last match" approach
nowadays: it's a super-set of "first match" and with the "short
circuiting" add-on feature it is also as convenient as "first match".

>> [...]
>> As for the order of the includes/excludes, the same logic can be implemented in
>> either order, so we need some other reason besides adding a new short-circuit
>> syntax to change it. For instance, your patch can be thought of as implementing
>> first-match on the short-circuit rules, and falling back to last match on the
>> rest of the rules (and .cvsignore files must get added at the bottom of the list
>> of rules). In the current rsync order this would be implemented as a priority
>> last-match of the short-circuit rules, followed by first-match of the normal
>> rules (and .cvsignore files continue to be added at the top of the list of
>> rules). So, a --last-match option should only be added if we wish to give the
>> user the option of writing their rules in the opposite order, and I'm not sure
>> we need that.
>>
>> As for the implementation, I'd prefer to see one that doesn't always match every
>> name against every item in the list (if we can help it). We can do this by
>> adding a "previous" pointer to the exclude_struct so that it can be scanned in
>> either order. The code would then scan in one direction for just the
>> short-circuit rules (if any exist), and then fall back to scanning in the
>> opposite direction for normal rules. If the --last-match option was still
>> desired, I would make its only affect be to change the order of how the user's
>> items get put into the list (so that the same scanning code could be used for
>> both modes).

Hmmmm... I'm not sure whether I understand what you have in mind here.
In general the "short-circuit" rules are not directly equal to the rules
in the "first match" approach. It is correct that a ruleset consisting
of "short-circuit" rules _only_ effectively degrades the "last match"
approach to a "first match" approach. But in mostly all practical
situations "short-circuit" and regular rules are _intermixed_ and the
position of rules in the "last match" approach is _not_ arbitrary
(neither with nor without "short-circuit" rules). Hence one cannot
perform an arbitrary way to match the rules!

I prefer to think about the rules in a "last match" approach as
sub-terms of a left-associative mathematical expression (evaluated from
left to right) on sets where the used operators are "union" (+) and
"difference" (-) [not intersection!]. Here the expression cannot be just
re-grouped to right-association without changing the resulting set. The
same for the rules in the include/exclude lists.

So, while in the "first match" approach all rules are more or less equal
and can be most of the time matched against in an arbitrary order,
in the "last match" approach the rules can be matched against in the
specified order _only_. This ordering semantics is what makes the "last
match" approach a lot more flexible and powerful. The "short-circuit"
add-on feature is just the usual convenience solution added to "last
match" approaches to make the rule list easier to read (because one do
not have to assemble all "short-circuit" rules at the end of the rule
list and instead can keep them together with other rules applying to
similar elements).

But perhaps I still not understand exactly how you think the two
approaches can be mixed together. IMHO they can't and that was the
reason why I used an explicit option --last-match to switch between the
approaches.

>> I'm currently considering some changes to the include/exclude code: namely a
>> modified version of the current merge-exclude-file.diff in the patches dir, but
>> with the syntax of the include-rule lines changed. Thus, the addition of a new
>> "overriding" include/exclude idiom would go well with this.

I've still not looked at this patch, but whatever you do to improve
rsync, please just keep in mind that at the end what should be possible
is to express the "left-associative mathematical expression on file
sets where the used operators are union and difference". How the
syntax looks, whether an explicit option has to enable it, etc is not
important. But it's important to be able to perform such more powerful
and flexible synchronizations.

                                       Ralf S. Engelschall
                                       rse@engelschall.com
                                       www.engelschall.com

Comment 4 Wayne Davison 2005-01-15 11:04:01 UTC

Please explain why the following reversed list and the current first-match
algorithm don't accomplish the same thing as the last-match example you cited:

 - dir2/dir22/file222
 + dir2/dir22/
 - dir2/*
 + *

I don't see any difference.

Comment 5 Ralf S. Engelschall 2005-01-15 14:47:29 UTC

Ops, yes, you're right. Sorry, my fault in assuming the whole time
that we want to use the usual "out to in" way of describing the
include/exclude list. Certainly a wrong assumption from my side. Yes,
you're fully right: if we open our mind and also accept "in to out"
descriptions (although I find them less intuitive) the
"first match" approach is fully sufficient and as powerful, too.

Remains just the questions how many people find "out to in" and how
many "in to out" descriptions more "natural". I'm a clear fan of "out
to in" thinking when it comes to rsync, because the whole rsync command
is a "synchronize this tree" command plus a few include/exclude options
covering the inside of the tree. And it's the way one is also used to
from mostly all modern ACL implementations I know of.

But that's just my point of view. I don't know what people prefer more.
Perhaps that's the reason why we shouldn't try to merge "first match"
and "last match" approaches in rsync, because one is more suitable for
"in to out" and the other for "out to in" descriptions. Hence both have
their right to exist IMHO. And by keeping "first match" as the default,
rsync also doesn't have any backward compatibility problems.

Comment 6 Wayne Davison 2005-02-25 10:00:32 UTC

Created attachment 985 [details]
Add optional last-match parsing

Here's a simple patch that allows any filter/include/exclude file to be parsed
in a last-match-wins format.  The file must contain this line (on its own):

[last-match]

This causes all the following rules to be reversed as they are read in, making
them behave in a last-match manner.  This shouldn't cause a problem in an
include/exclude file because the above string would have been a bogus character
range that nobody should have specified.

Comment 7 Wayne Davison 2005-02-25 10:05:02 UTC

I considered the short-circuit operators (-! and +!) for filter commands, but I
am not convinced of their utility.  Do you have an example of a real-life
situation that they improve?

Note also that the suggested syntax for the new operators is now being used to
specify a rule that triggers when the pattern fails to match (such as "-! */")
so if this idiom is needed it would need a new syntax (perhaps "-*" and "+*").

Comment 8 Wayne Davison 2006-01-29 01:39:16 UTC

I think we don't need this.