uwildmat, uwildmat_simple, uwildmat_poison - Perform wildmat matching
#include <inn/libinn.h>
bool uwildmat(const char *text, const char *pattern);
bool uwildmat_simple(const char *text, const char *pattern);
enum uwildmat uwildmat_poison(const char *text, const char *pattern);
uwildmat compares
text against the wildmat expression
pattern, returning true if and only if the expression matches the text.
"@" has no special meaning in
pattern when passed to
uwildmat. Both
text and
pattern are assumed to be in the
UTF-8 character encoding, although malformed UTF-8 sequences are treated in a
way that attempts to be mostly compatible with single-octet character sets
like ISO 8859-1. (In other words, if you try to match ISO 8859-1 text with
these routines everything should work as expected unless the ISO 8859-1 text
contains valid UTF-8 sequences, which thankfully is somewhat rare.)
uwildmat_simple is identical to
uwildmat except that neither
"!" nor "," have any special meaning and
pattern is
always treated as a single pattern. This function exists solely to support
legacy interfaces like NNTP's XPAT command, and should be avoided when
implementing new features.
uwildmat_poison works similarly to
uwildmat, except that
"@" as the first character of one of the patterns in the expression
(see below) "poisons" the match if it matches.
uwildmat_poison returns
UWILDMAT_MATCH if the expression matches
the text,
UWILDMAT_FAIL if it doesn't, and
UWILDMAT_POISON if
the expression doesn't match because a poisoned pattern matched the text.
These enumeration constants are defined in the
inn/libinn.h header.
A wildmat expression follows rules similar to those of shell filename wildcards
but with some additions and changes. A wildmat
expression is composed
of one or more wildmat
patterns separated by commas. Each character in
the wildmat pattern matches a literal occurrence of that same character in the
text, with the exception of the following metacharacters:
- ?
- Matches any single character (including a single UTF-8
multibyte character, so "?" can match more than one byte).
- *
- Matches any sequence of zero or more characters.
- \
- Turns off any special meaning of the following character;
the following character will match itself in the text. "\" will
escape any character, including another backslash or a comma that
otherwise would separate a pattern from the next pattern in an expression.
Note that "\" is not special inside a character range (no
metacharacters are).
- [...]
- A character set, which matches any single character that
falls within that set. The presence of a character between the brackets
adds that character to the set; for example, "[amv]" specifies
the set containing the characters "a", "m", and
"v". A range of characters may be specified using "-";
for example, "[0-5abc]" is equivalent to
"[012345abc]". The order of characters is as defined in the
UTF-8 character set, and if the start character of such a range falls
after the ending character of the range in that ranking the results of
attempting a match with that pattern are undefined.
In order to include a literal "]" character in the set, it must be
the first character of the set (possibly following "^"); for
example, "[]a]" matches either "]" or "a".
To include a literal "-" character in the set, it must be either
the first or the last character of the set. Backslashes have no special
meaning inside a character set, nor do any other of the wildmat
metacharacters.
- [^...]
- A negated character set. Follows the same rules as a
character set above, but matches any character not contained in the
set. So, for example, "[^]-]" matches any character except
"]" and "-".
In addition, "!" (and possibly "@") have special meaning as
the first character of a pattern; see below.
When matching a wildmat expression against some text, each comma-separated
pattern is matched in order from left to right. In order to match, the pattern
must match the whole text; in regular expression terminology, it's implicitly
anchored at both the beginning and the end. For example, the pattern
"a" matches only the text "a"; it doesn't match
"ab" or "ba" or even "aa". If none of the
patterns match, the whole expression doesn't match. Otherwise, whether the
expression matches is determined entirely by the rightmost matching pattern;
the expression matches the text if and only if the rightmost matching pattern
is not negated.
For example, consider the text "news.misc". The expression
"*" matches this text, of course, as does "comp.*,news.*"
(because the second pattern matches). "news.*,!news.misc" does not
match this text because both patterns match, meaning that the rightmost takes
precedence, and the rightmost matching pattern is negated.
"news.*,!news.misc,*.misc" does match this text, since the rightmost
matching pattern is not negated.
Note that the expression "!news.misc" can't match anything. Either the
pattern doesn't match, in which case no patterns match and the expression
doesn't match, or the pattern does match, in which case because it's negated
the expression doesn't match. "*,!news.misc", on the other hand, is
a useful pattern that matches anything except "news.misc".
"!" has significance only as the first character of a pattern;
anywhere else in the pattern, it matches a literal "!" in the text
like any other non-metacharacter.
If the
uwildmat_poison interface is used, then "@" behaves the
same as "!" except that if an expression fails to match because the
rightmost matching pattern began with "@",
UWILDMAT_POISON is
returned instead of
UWILDMAT_FAIL.
If the
uwildmat_simple interface is used, the matching rules are the same
as above except that none of "!", "@", or ","
have any special meaning at all and only match those literal characters.
All of these functions internally convert the passed arguments to const unsigned
char pointers. The only reason why they take regular char pointers instead of
unsigned char is for the convenience of INN and other callers that may not be
using unsigned char everywhere they should. In a future revision, the public
interface should be changed to just take unsigned char pointers.
Written by Rich $alz <
[email protected]> in 1986, and posted to Usenet
several times since then, most notably in comp.sources.misc in March, 1991.
Lars Mathiesen <
[email protected]> enhanced the multi-asterisk failure mode
in early 1991.
Rich and Lars increased the efficiency of star patterns and reposted it to
comp.sources.misc in April, 1991.
Robert Elz <
[email protected]> added minus sign and close bracket handling
in June, 1991.
Russ Allbery <
[email protected]> added support for comma-separated patterns
and the "!" and "@" metacharacters to the core wildmat
routines in July, 2000. He also added support for UTF-8 characters, changed
the default behavior to assume that both the text and the pattern are in
UTF-8, and largely rewrote this documentation to expand and clarify the
description of how a wildmat expression matches.
Please note that the interfaces to these functions are named
uwildmat and
the like rather than
wildmat to distinguish them from the
wildmat function provided by Rich $alz's original implementation. While
this code is heavily based on Rich's original code, it has substantial
differences, including the extension to support UTF-8 characters, and has
noticeable functionality changes. Any bugs present in it aren't Rich's fault.
grep(1),
fnmatch(3),
regex(3),
regexp(3).