I've added a --last-match command line option to RSYNC 2.6.3 which switches the include/exclude pattern processing from "first-match" (the default) to "last-match" semantics plus the possibility to specify short-circuit patterns (which stop processing immediately if matching) with the operators "-!" and "+!" in addition to the regular operators "-" and "+". The "last-match" is a super-set of the "first-match" semantics, providing more flexible include/exclude specifications by allowing arbitrary nesting of matches. For patch see the OpenPKG repository: http://cvs.openpkg.org/getfile/openpkg-src/rsync/rsync.patch.lastmatch
Created attachment 887 [details] rsync.patch.lastmatch
I'm not totally sure what you mean by "arbitrary nesting of matches", but I'll assume that you're talking about being able to override the rules of a .cvsignore file (which is not currently possible) and to string together multiple generic --include-from/--exclude-from files on the command-line and have some of the files specify a rule that will override the rules of a later file (or earlier, depending on the order of the scan). This is something that was talked about before, and I think that it would be a nice idea to add something like this to rsync. Your syntax choice looks like a good one. (I had previously suggested using "++ pattern" instead of "+! pattern", but I like "+!" better). This does mean that an incompatibility in the syntax of exclude files is introduced (e.g. a file named "+! foo" was not previously special). I'm thinking that it would be a good idea to make the code escape all names that start with "+" or "-" (e.g. prefix "+ " or "- ", as appropriate). This would allow the adding of any non-space suffixes to the initial "+" or "-" to change their meaning without adding any new incompatibilities in the syntax of the include/exclude files. I think I'll go ahead and add this to the code that sends the name over the socket, as it does not interfere with backward compatibility (and improves forward compatibility). As for the order of the includes/excludes, the same logic can be implemented in either order, so we need some other reason besides adding a new short-circuit syntax to change it. For instance, your patch can be thought of as implementing first-match on the short-circuit rules, and falling back to last match on the rest of the rules (and .cvsignore files must get added at the bottom of the list of rules). In the current rsync order this would be implemented as a priority last-match of the short-circuit rules, followed by first-match of the normal rules (and .cvsignore files continue to be added at the top of the list of rules). So, a --last-match option should only be added if we wish to give the user the option of writing their rules in the opposite order, and I'm not sure we need that. As for the implementation, I'd prefer to see one that doesn't always match every name against every item in the list (if we can help it). We can do this by adding a "previous" pointer to the exclude_struct so that it can be scanned in either order. The code would then scan in one direction for just the short-circuit rules (if any exist), and then fall back to scanning in the opposite direction for normal rules. If the --last-match option was still desired, I would make its only affect be to change the order of how the user's items get put into the list (so that the same scanning code could be used for both modes). I'm currently considering some changes to the include/exclude code: namely a modified version of the current merge-exclude-file.diff in the patches dir, but with the syntax of the include-rule lines changed. Thus, the addition of a new "overriding" include/exclude idiom would go well with this.
On Thu, Jan 13, 2005, samba-bugs@samba.org wrote: >> I'm not totally sure what you mean by "arbitrary nesting of matches", but I'll >> assume that you're talking about being able to override the rules of a >> .cvsignore file (which is not currently possible) and to string together >> multiple generic --include-from/--exclude-from files on the command-line and >> have some of the files specify a rule that will override the rules of a later >> file (or earlier, depending on the order of the scan). Yes. Just a typical example. You have the following tree: dir1/ (+) dir2/ (+) dir21/ (-) dir22/ (+) file221 (+) file222 (-) file223 (+) dir23/ (-) dir3/ (+) Now consider that everything should be synchronized with dir2/* excluded except for dir2/dir22 and that dir2/dir22/file222 also should be not synchronized. With the current "first match" approach shuch a synchronization cannot be done at all in a generic way. The only possibility would be to perform multiple individual synchronizations, but this becomes impossible if at some intermediate level an abitrary number files or dirs exist. With the "last match" approach such a synchronization is trivial: + * - dir2/* + dir2/dir22/ - dir2/dir22/file222 That's especially also the reason why mostly all modern packet filters or similar access control mechanisms use the "last match" approach nowadays: it's a super-set of "first match" and with the "short circuiting" add-on feature it is also as convenient as "first match". >> [...] >> As for the order of the includes/excludes, the same logic can be implemented in >> either order, so we need some other reason besides adding a new short-circuit >> syntax to change it. For instance, your patch can be thought of as implementing >> first-match on the short-circuit rules, and falling back to last match on the >> rest of the rules (and .cvsignore files must get added at the bottom of the list >> of rules). In the current rsync order this would be implemented as a priority >> last-match of the short-circuit rules, followed by first-match of the normal >> rules (and .cvsignore files continue to be added at the top of the list of >> rules). So, a --last-match option should only be added if we wish to give the >> user the option of writing their rules in the opposite order, and I'm not sure >> we need that. >> >> As for the implementation, I'd prefer to see one that doesn't always match every >> name against every item in the list (if we can help it). We can do this by >> adding a "previous" pointer to the exclude_struct so that it can be scanned in >> either order. The code would then scan in one direction for just the >> short-circuit rules (if any exist), and then fall back to scanning in the >> opposite direction for normal rules. If the --last-match option was still >> desired, I would make its only affect be to change the order of how the user's >> items get put into the list (so that the same scanning code could be used for >> both modes). Hmmmm... I'm not sure whether I understand what you have in mind here. In general the "short-circuit" rules are not directly equal to the rules in the "first match" approach. It is correct that a ruleset consisting of "short-circuit" rules _only_ effectively degrades the "last match" approach to a "first match" approach. But in mostly all practical situations "short-circuit" and regular rules are _intermixed_ and the position of rules in the "last match" approach is _not_ arbitrary (neither with nor without "short-circuit" rules). Hence one cannot perform an arbitrary way to match the rules! I prefer to think about the rules in a "last match" approach as sub-terms of a left-associative mathematical expression (evaluated from left to right) on sets where the used operators are "union" (+) and "difference" (-) [not intersection!]. Here the expression cannot be just re-grouped to right-association without changing the resulting set. The same for the rules in the include/exclude lists. So, while in the "first match" approach all rules are more or less equal and can be most of the time matched against in an arbitrary order, in the "last match" approach the rules can be matched against in the specified order _only_. This ordering semantics is what makes the "last match" approach a lot more flexible and powerful. The "short-circuit" add-on feature is just the usual convenience solution added to "last match" approaches to make the rule list easier to read (because one do not have to assemble all "short-circuit" rules at the end of the rule list and instead can keep them together with other rules applying to similar elements). But perhaps I still not understand exactly how you think the two approaches can be mixed together. IMHO they can't and that was the reason why I used an explicit option --last-match to switch between the approaches. >> I'm currently considering some changes to the include/exclude code: namely a >> modified version of the current merge-exclude-file.diff in the patches dir, but >> with the syntax of the include-rule lines changed. Thus, the addition of a new >> "overriding" include/exclude idiom would go well with this. I've still not looked at this patch, but whatever you do to improve rsync, please just keep in mind that at the end what should be possible is to express the "left-associative mathematical expression on file sets where the used operators are union and difference". How the syntax looks, whether an explicit option has to enable it, etc is not important. But it's important to be able to perform such more powerful and flexible synchronizations. Ralf S. Engelschall rse@engelschall.com www.engelschall.com
Please explain why the following reversed list and the current first-match algorithm don't accomplish the same thing as the last-match example you cited: - dir2/dir22/file222 + dir2/dir22/ - dir2/* + * I don't see any difference.
Ops, yes, you're right. Sorry, my fault in assuming the whole time that we want to use the usual "out to in" way of describing the include/exclude list. Certainly a wrong assumption from my side. Yes, you're fully right: if we open our mind and also accept "in to out" descriptions (although I find them less intuitive) the "first match" approach is fully sufficient and as powerful, too. Remains just the questions how many people find "out to in" and how many "in to out" descriptions more "natural". I'm a clear fan of "out to in" thinking when it comes to rsync, because the whole rsync command is a "synchronize this tree" command plus a few include/exclude options covering the inside of the tree. And it's the way one is also used to from mostly all modern ACL implementations I know of. But that's just my point of view. I don't know what people prefer more. Perhaps that's the reason why we shouldn't try to merge "first match" and "last match" approaches in rsync, because one is more suitable for "in to out" and the other for "out to in" descriptions. Hence both have their right to exist IMHO. And by keeping "first match" as the default, rsync also doesn't have any backward compatibility problems.
Created attachment 985 [details] Add optional last-match parsing Here's a simple patch that allows any filter/include/exclude file to be parsed in a last-match-wins format. The file must contain this line (on its own): [last-match] This causes all the following rules to be reversed as they are read in, making them behave in a last-match manner. This shouldn't cause a problem in an include/exclude file because the above string would have been a bogus character range that nobody should have specified.
I considered the short-circuit operators (-! and +!) for filter commands, but I am not convinced of their utility. Do you have an example of a real-life situation that they improve? Note also that the suggested syntax for the new operators is now being used to specify a rule that triggers when the pattern fails to match (such as "-! */") so if this idiom is needed it would need a new syntax (perhaps "-*" and "+*").
I think we don't need this.