This page was generated by Text::SmartLinks v0.01 at 2015-05-31 10:00:06 GMT.
(syn df6900d)
  [ Index of Synopses ]

TITLE

Synopsis 5: Regexes and Rules

VERSION

    Created: 24 Jun 2002
    Last Modified: 12 May 2015
    Version: 180

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

Overview

In essence, Perl 6 natively implements Parsing Expression Grammars (PEGs) as an extension of regular expression notation. PEGs require that you provide a "pecking order" for ambiguous parses. Perl 6's pecking order is determined by a multi-level tie-breaking test:

    1) Most-derived only/proto hides less-derived of the same name
    2) Longest token matching: food\s+ beats foo by 2 or more positions
    3) Longest literal prefix: food\w* beats foo\w* by 1 position
    4) For a given proto, multis from a more-derived grammar win
    5) Within a given compilation unit, the earlier alternative or multi wins

Tiebreaker #3 will treat any initial sequence of literals as the longest literal prefix. If there is an alternation embedded in the longest token matching, those alternations can extend a literal prefix provided everything was literal up to the alternation. If all of the alternations are totally literal, then the literal can also extend beyond the end of the alternation when they rejoin. Otherwise the end of the alternation terminates all longest literal prefixes, even the branches that are totally literal. For example:

    / a [ 1 | 2 ] b /   # longest literals are 'a1b' and 'a2b'
    / a [ 1 | 2\w ] b / # longest literals are 'a1 and 'a2'
    / a <[ 1 2 ]> b /   # longest literal is only 'a'

Note that in this case, a character class is not treated the same as an alternation. All character classes are considered too generic to include in a longest literal string.

As with longest token matching, longest literal prefixes are transitive through subrules. If a subrule is a protoregex, it is treated just as alternation with | is, and follows the same rules about extending or terminating the longest literal prefix.

In addition to this pecking order, if any rule chosen under the pecking backtracks, the next best rule is chosen. That is, the pecking order determines a candidate list; just because one candidate is chosen does not mean the rest are thrown away. They may, however, be explicitly thrown away by an appropriate backtracking control (sometimes called a "cut" operator, but Perl 6 has several of them, depending on how much you want to cut).

Also, any rule chosen to execute under #1 may choose to delegate to its ancestors; PEG backtracking has no control over this.

New match result and capture variables

The underlying match object is now available via the $/ variable, which is implicitly lexically scoped. All user access to the most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

In order to detect accidental use of Perl 5's unrelated $/ variable, Perl 6's $/ variable may not be assigned to directly.

    $/ = $x;   # "Unsupported use of $/ variable as input record separator"
    $/ := $x;  # OK, binding
    $/ RR= $x; # OK, metaoperator
    ($/) = $x; # OK, list assignment

Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokener for a grammar because Perl will write one for you. See the section below on "Longest-token matching".

Simplified lexical parsing of patterns

Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the same interpolative semantics as the current language in which the regex is lexically embedded.) Quotes create a quantifiable atom, so while

    moose*

quantifies only the 'e' and matches "mooseee", saying

    'moose'*

quantifies the whole string and would match "moosemoose".

Here is a table that summarizes the distinctions:

                 Alphanumerics        Non-alphanumerics         Mixed
 Literal glyphs   a    1    _        \*  \$  \.   \\   \'       K\-9\!
 Metasyntax      \a   \1   \_         *   $   .    \    '      \K-\9!
 Quoted glyphs   'a'  '1'  '_'       '*' '$' '.' '\\' '\''     'K-9!'

In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.

Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1 \_ - !). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.

The semicolon character is specifically reserved as a non-meaningful metacharacter; if an unquoted semicolon is seen, the compiler will complain that the regex is missing its terminator.

Modifiers

Allowed modifiers

Some modifiers are allowed in all possible places where modifiers can occur, but not all of them.

In general, a modifier that affects the compilation of a regex (like :i) must be known at compile time. A modifier that affects only the calling behaviour, and not the regex itself (eg. :pos, :overlap, :x(4)) may only appear on constructs that involve a call (like m// and s///), and not on rx//. Finally overlapping is disallowed on substitutions, while adverbs that affect modifications (eg. :samecase) are only allowed on substitutions.

These principle result in the following rules:

Changed metacharacters

New metacharacters

Bracket rationalization

Variable (non-)interpolation

Extensible metasyntax (<...>)

Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.

For matched pairs, the first character after < determines the nature of the assertion:

The following tokens include angles but are not required to balance:

Predefined Subrules

These are some of the predefined subrules for any grammar or regex:

Backslash reform

Character class shortcuts

For historical and convenience reasons, the following character classes are available as backslash sequences:

   \d      <digit>    A digit
   \D      <-digit>   A nondigit
   \w      <alnum>    A word character
   \W      <-alnum>   A non-word character
   \s      <sp>       A whitespace character
   \S      <-sp>      A non-whitespace character
   \h                 A horizontal whitespace
   \H                 A non-horizontal whitespace
   \v                 A vertical whitespace
   \V                 A non-vertical whitespace

Regexes constitute a first-class language, rather than just being strings

Backtracking control

Within those portions of a pattern that are considered procedural rather than declarative, you may control the backtracking behavior.

Regex Routines, Named and Anonymous

Nothing is illegal

Longest-token matching

Since "longest-token matching" is a long phrase, we will usually refer to this idea as LTM. The basic notion is that LTM is how people tend to parse text in their heads, so the computer ought to try to do the same. And parsing with LTM is all about how the computer decides which alternative of a set of alternatives is going to match.

Instead of representing temporal alternation as it does in Perl 5, in Perl 6 | represents logical alternation with declarative longest-token semantics. (You may now use || to indicate the old temporal alternation. That is, | and || now work within regex syntax much the same as they do outside of regex syntax, where they represent junctional and short-circuit OR. This includes the fact that | has tighter precedence than ||.)

Historically regex processing has proceeded in Perl via a backtracking NFA algorithm. This is quite powerful, but many parsers work more efficiently by processing rules in parallel rather than one after another, at least up to a point. If you look at something like a yacc grammar, you find a lot of pattern/action declarations where the patterns are considered in parallel, and eventually the grammar decides which action to fire off. While the default Perl view of parsing is essentially top-down (perhaps with a bottom-up "middle layer" to handle operator precedence), it is extremely useful for user understanding if at least the token processing proceeds deterministically. So for regex matching purposes we define token patterns as those patterns that can be matched without potential side effects or self-reference. (Since whitespace often has side effects at line transitions, it is usually excluded from such patterns, give or take a little lookahead.) Basically, Perl automatically derives a lexer from the grammar without you having to write one yourself.

To that end, every regex in Perl 6 is required to be able to distinguish its "pure" patterns from its actions, and return its list of initial token patterns (transitively including the token patterns of any subrule called by the "pure" part of that regex, but not including any subrule more than once, since that would involve self reference, which is not allowed in traditional regular expressions). A logical alternation using | then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically.

However, if two alternatives match at the same length, the tie is broken first by specificity. The alternative that starts with the longest fixed string wins; that is, an exact match counts as closer than a match made using character classes. If that doesn't work, the tie is broken by one of two methods. If the alternatives are in different grammars, standard MRO (method resolution order) determines which one to try first. If the alternatives are in the same grammar file, the textually earlier alternative takes precedence. (If a grammar's rules are defined in more than one file, the order is undefined, and an explicit assertion must be used to force failure if the wrong one is tried first.)

This longest token prefix corresponds roughly to the notion of "token" in other parsing systems that use a lexer, but in the case of Perl this is largely an epiphenomenon derived automatically from the grammar definition. However, despite being automatically calculated, the set of tokens can be modified by the user; various constructs within a regex declaratively tell the grammar engine that it is finished with the pattern part and starting in on the side effects, so by inserting such constructs the user controls what is considered a token and what is not. The constructs deemed to terminate a token declaration and start the "action" part of the pattern include:

Subpatterns (captures) specifically do not terminate the token pattern, but may require a reparse of the token to find the location of the subpatterns. Likewise assertions may need to be checked out after the longest token is determined. (Alternately, if DFA semantics are simulated in any of various ways, such as by Thompson NFA, it may be possible to know when to fire off the assertions without backchecks.)

Greedy quantifiers and character classes do not terminate a token pattern. Zero-width assertions such as word boundaries are also okay.

Because such assertions can be part of the token, the lexer engine must be able to recover from the failure of such an assertion and backtrack to the next best token candidate, which might be the same length or shorter, but can never be longer than the current candidate.

For a pattern that contains a positive lookahead assertion such as <?foo> or <?before \s>, the assertion is assumed to be more specific than the subsequent pattern, so the lookahead's pattern is counted as the final part of the longest token; the longest-token matcher will be smart enough to treat the extra bit as 0-width, that is, to rematch any text traversed by the lookahead when (and if) it continues the match. (Indeed, if the entire lookahead is pure enough to participate in LTM, the rematcher may simply optimize away the rematching, since the lookahead already matched in the LTM engine.)

However, for a pattern that contains a negative lookahead assertion such as <!foo> or <!before \s>, just the opposite is true: the subsequent pattern is assumed to be more specific than the assertion's. So LTM completely ignores negative lookaheads, and continues to look for pure patterns in whatever follows the negative lookahead. You might say that positive lookaheads are opaque to LTM, but negative lookaheads are transparent to LTM. As a consequence, if you wish to write a positive lookahead that is transparent to LTM, you may indicate this with a double negation: <!!foo>. (The optimizer is free to remove the double negation, but not the transparency.)

Oddly enough, the token keyword specifically does not determine the scope of a token, except insofar as a token pattern usually doesn't do much matching of whitespace, and whitespace is the prototypical way of terminating tokens.

The initial token matcher must take into account case sensitivity (or any other canonicalization primitives) and do the right thing even when propagated up to rules that don't have the same canonicalization. That is, they must continue to represent the set of matches that the lower rule would match.

The || form has the old short-circuit semantics, and will not attempt to match its right side unless all possibilities (including all | possibilities) are exhausted on its left. The first || in a regex makes the token patterns on its left available to the outer longest-token matcher, but hides any subsequent tests from longest-token matching. Every || establishes a new longest-token matcher. That is, if you use | on the right side of ||, that right side establishes a new top level scope for longest-token processing for this subexpression and any called subrules. The right side's longest-token automaton is invisible to the left of the || or outside the regex containing the ||.

Return values from matches

Match objects

Subpattern captures

Accessing captured subpatterns

Nested subpattern captures

Quantified subpattern captures

Indirectly quantified subpattern captures

Subpattern numbering

Subrule captures

Accessing captured subrules

Repeated captures of the same subrule

Aliasing

Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.

Named scalar aliasing to subpatterns

Named scalar aliases applied to non-capturing brackets

Named scalar aliasing to subrules

Numbered scalar aliasing

Scalar aliases applied to quantified constructs

Array aliasing

Hash aliasing

External aliasing

Capturing from repeated matches

Grammars

Syntactic categories

For writing your own backslash and assertion subrules, you may augment (your copy of) the Regex sublanguage, using the following syntactic categories:

    augment slang Regex {
        token backslash:sym<y> { ... }   # define your own \y and \Y
        token assertion:sym<*> { ... }   # define your own <*stuff>
        token metachar:sym<,> { ... }    # define a new metacharacter
        multi method tweak (:$x) {...}   # define your own :x modifier
    }

Pragmas

Various pragmas may be used to control various aspects of regex compilation and usage not otherwise provided for. These are tied to the particular declarator in question:

    use s :foo;         # control s defaults
    use m :foo;         # control m defaults
    use rx :foo;        # control rx defaults
    use regex :foo;     # control regex defaults
    use token :foo;     # control token defaults
    use rule :foo;      # control rule defaults

(It is a general policy in Perl 6 that any pragma designed to influence the surface behavior of a keyword is identical to the keyword itself, unless there is good reason to do otherwise. On the other hand, pragmas designed to influence deep semantics should not be named identically, though of course some similarity is good.)

Transliteration

Substitution

There are also method forms of m// and s///:

     $str.match(/pat/);
     $str.subst(/pat/, "replacement");
     $str.subst(/pat/, {"replacement"});
     $str.=subst(/pat/, "replacement");
     $str.=subst(/pat/, {"replacement"});

The .match and .subst methods support the adverbs of m// and s/// as named arguments, so you can write

    $str.match(/pat/, :g)

as an equivalent to

    $str.comb(/pat/, :match)

There is no syntactic sugar here, so in order to get deferred evaluation of the replacement you must put it into a closure. The syntactic sugar is provided only by the quotelike forms. First there is the standard "triple quote" form:

    s/pattern/replacement/

Only non-bracket characters may be used for the "triple quote". The right side is always evaluated as if it were a double-quoted string regardless of the quote chosen.

As with Perl 5, a bracketing form is also supported, but unlike Perl 5, Perl 6 uses the brackets only around the pattern. The replacement is then specified as if it were an ordinary item assignment, with ordinary quoting rules. To pick your own quotes on the right just use one of the q forms. The substitution above is equivalent to:

    s[pattern] = "replacement"

or

    s[pattern] = qq[replacement]

This is not a normal assignment, since the right side is evaluated each time the substitution matches (much like the pseudo-assignment to declarators can happen at strange times). It is therefore treated as a "thunk", that is, it will be called as a chunk of code that creates a dynamic scope but not a lexical scope. (You can also think of a thunk as a closure that uses the current lexical scope parasitically.) In fact, it makes no sense at all to say

    s[pattern] = { doit }

because that would try to substitute a closure into the string.

Any scalar assignment operator may be used; the substitution macro knows how to turn

    $target ~~ s:g[pattern] op= expr

into something like:

    $target.=subst(rx[pattern], { $() op expr }, :g)

(The actual implementation of s/// must return a Match to make smartmatch work right. The rewrite above merely returns the changed string.)

So, for example, you can multiply every dollar amount by 2 with:

    s:g[\$ <( \d+ )>] *= 2

(Of course, the optimizer is free to do something faster than an actual method call.)

You'll note from the last example that substitutions only happen on the "official" string result of the match, that is, the portion of the string between the $/.from and $/.to positions. (Here we set those explicitly using the <(...)> pair; otherwise we would have had to use lookbehind to match the $.)

Please note that the :ii/:samecase and :mm/:samemark switches are really two different modifiers in one, and when the compiler desugars the quote-like forms it distributes semantics to both the pattern and the replacement. That is, :ii on the replacement implies a :i on the pattern, and :mm implies :m. The proper method equivalents to:

    s:ii/foo/bar/
    s:mm/boo/far/

are not:

    .subst(/foo/, 'bar', :ii)   # WRONG
    .subst(/boo/, 'far', :mm)   # WRONG

but rather:

    .subst(rx:i/foo/, 'bar', :ii)   # okay
    .subst(rx:m/boo/, 'far', :mm)   # okay

It is specifically not required of an implementation that it treat the regexes as generic with respect to case and mark. Retroactive recompilation is considered harmful. If an implementation does do lazy generic case and mark semantics, it is erroneous and non-portable for a program to depend on it.

One other difference between the s/// and .subst forms is that, while .subst returns the modified string (and cannot, therefore, be used as a smart matcher), the s/// form always returns either a Match object to indicate to smartmatch that it was successful, or a Nil value to indicate that it was not.

Likewise, for both m:g matches and s:g substitutions, there may be multiple matches found. These constructs must still continue to work under smartmatching while returning a list of matches. Fortunately, List is one of the distinguished types that a matcher may return to indicate success or failure. So these construct simply return the list of successful matches, which will be empty (and hence false) if no matches occurred.

Positional matching, fixed width types

Matching against non-strings

When $/ is valid

To provide implementational freedom, the $/ variable is not guaranteed to be defined until the pattern reaches a sequence point that requires it (such as completing the match, or calling an embedded closure, or even evaluating a submatch that requires a Perl expression for its argument). Within regex code, $/ is officially undefined, and references to $0 or other capture variables may be compiled to produce the current value without reference to $/. Likewise a reference to $<foo> does not necessarily mean $/<foo> within the regex proper. During the execution of a match, the current match state is actually stored in a variable lexically scoped to an appropriate portion of the match, but that is not guaranteed to behave the same as the $/ object, because $/ is of type Match, while the match state is of a type derived from Cursor.

In any case this is all transparent to the user for simple matches; and outside of regex code (and inside closures within the regex) the $/ variable is guaranteed to represent the state of the match at that point. That is, normal Perl code can always depend on $<foo> meaning $/<foo>, and $0 meaning $/[0], whether that code is embedded in a closure within the regex or outside the regex after the match completes.

AUTHORS

    Damian Conway <damian@conway.org>
    Allison Randal <al@shadowed.net>
    Patrick Michaud <pmichaud@pobox.com>
    Larry Wall <larry@wall.org>
    Moritz Lenz <moritz@faui2k3.org>
    Tobias Leich <email@froggs.de>
[ Top ]   [ Index of Synopses ]