PCREPATTERN(3)                                   PCREPATTERN(3)





NAME
       PCRE - Perl-compatible regular expressions

PCRE REGULAR EXPRESSION DETAILS

       The syntax and semantics of the regular expressions sup-
       ported by PCRE are described below. Regular  expressions
       are  also  described  in the Perl documentation and in a
       number of books, some of which  have  copious  examples.
       Jeffrey  Friedl's  "Mastering Regular Expressions", pub-
       lished by O'Reilly, covers regular expressions in  great
       detail.  This  description of PCRE's regular expressions
       is intended as reference material.

       The original operation of PCRE was on  strings  of  one-
       byte  characters. However, there is now also support for
       UTF-8 character strings. To use  this,  you  must  build
       PCRE  to  include UTF-8 support, and then call pcre_com-
       pile() with the PCRE_UTF8 option. How this affects  pat-
       tern  matching  is  mentioned  in  several places below.
       There is also a summary of UTF-8 features in the section
       on UTF-8 support in the main pcre page.

       The  remainder  of  this document discusses the patterns
       that are supported by PCRE when its main matching  func-
       tion,  pcre_exec(),  is  used.   From  release 6.0, PCRE
       offers  a  second  matching  function,  pcre_dfa_exec(),
       which  matches  using  a different algorithm that is not
       Perl-compatible. The advantages and disadvantages of the
       alternative function, and how it differs from the normal
       function, are discussed in the pcrematching page.

       A regular  expression  is  a  pattern  that  is  matched
       against  a subject string from left to right. Most char-
       acters stand for themselves in a pattern, and match  the
       corresponding  characters  in  the subject. As a trivial
       example, the pattern

         The quick brown fox

       matches a portion of a subject string that is  identical
       to  itself.  When  caseless  matching  is specified (the
       PCRE_CASELESS option), letters are matched independently
       of case. In UTF-8 mode, PCRE always understands the con-
       cept of case for characters whose values are  less  than
       128,  so caseless matching is always possible. For char-
       acters with higher values, the concept of case  is  sup-
       ported  if  PCRE  is compiled with Unicode property sup-
       port, but not otherwise.  If you want  to  use  caseless
       matching  for  characters 128 and above, you must ensure
       that PCRE is compiled with Unicode property  support  as
       well as with UTF-8 support.

       The  power of regular expressions comes from the ability
       to include alternatives and repetitions in the  pattern.
       These are encoded in the pattern by the use of metachar-
       acters, which do not stand for  themselves  but  instead
       are interpreted in some special way.

       There  are  two  different sets of metacharacters: those
       that are  recognized  anywhere  in  the  pattern  except
       within square brackets, and those that are recognized in
       square brackets. Outside square brackets, the  metachar-
       acters are as follows:

         \      general escape character with several uses
         ^       assert  start of string (or line, in multiline
       mode)
         $      assert end of string  (or  line,  in  multiline
       mode)
         .      match any character except newline (by default)
         [      start character class definition
         |      start of alternative branch
         (      start subpattern
         )      end subpattern
         ?      extends the meaning of (
                also 0 or 1 quantifier
                also quantifier minimizer
         *      0 or more quantifier
         +      1 or more quantifier
                also "possessive quantifier"
         {      start min/max quantifier

       Part of a pattern that is in square brackets is called a
       "character   class".  In  a  character  class  the  only
       metacharacters are:

         \      general escape character
         ^      negate the class, but only if the first charac-
       ter
         -      indicates character range
         [       POSIX  character  class  (only  if followed by
       POSIX
                  syntax)
         ]      terminates the character class

       The following sections describe the use of each  of  the
       metacharacters.

BACKSLASH

       The backslash character has several uses. Firstly, if it
       is followed by a non-alphanumeric  character,  it  takes
       away  any  special meaning that character may have. This
       use of backslash as an  escape  character  applies  both
       inside and outside character classes.

       For  example,  if  you  want to match a * character, you
       write \* in the pattern.  This escaping  action  applies
       whether  or  not the following character would otherwise
       be interpreted as a metacharacter, so it is always  safe
       to  precede a non-alphanumeric with backslash to specify
       that it stands for itself. In particular, if you want to
       match a backslash, you write \\.

       If  a pattern is compiled with the PCRE_EXTENDED option,
       whitespace in the pattern (other  than  in  a  character
       class)  and  characters  between a # outside a character
       class and the next newline  character  are  ignored.  An
       escaping  backslash  can be used to include a whitespace
       or # character as part of the pattern.

       If you  want  to  remove  the  special  meaning  from  a
       sequence  of  characters,  you can do so by putting them
       between \Q and \E. This is different from Perl in that $
       and  @  are  handled as literals in \Q...\E sequences in
       PCRE, whereas in Perl, $ and @ cause variable interpola-
       tion. Note the following examples:

         Pattern            PCRE matches   Perl matches

         \Qabc$xyz\E        abc$xyz        abc followed by the
                                             contents of $xyz
         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz

       The  \Q...\E sequence is recognized both inside and out-
       side character classes.

   Non-printing characters

       A second use of backslash provides  a  way  of  encoding
       non-printing characters in patterns in a visible manner.
       There is no restriction on the appearance of  non-print-
       ing  characters,  apart from the binary zero that termi-
       nates a pattern, but when a pattern is being prepared by
       text  editing,  it  is  usually easier to use one of the
       following escape sequences than the binary character  it
       represents:

         \a        alarm, that is, the BEL character (hex 07)
         \cx       "control-x", where x is any character
         \e        escape (hex 1B)
         \f        formfeed (hex 0C)
         \n        newline (hex 0A)
         \r        carriage return (hex 0D)
         \t        tab (hex 09)
         \ddd      character with octal code ddd, or backrefer-
       ence
         \xhh      character with hex code hh
         \x{hhh..} character with hex code hhh...  (UTF-8  mode
       only)

       The precise effect of \cx is as follows: if x is a lower
       case letter, it is converted to upper case. Then  bit  6
       of the character (hex 40) is inverted.  Thus \cz becomes
       hex 1A, but \c{ becomes hex 3B, while  \c;  becomes  hex
       7B.

       After  \x,  from zero to two hexadecimal digits are read
       (letters can be in upper or lower case). In UTF-8  mode,
       any  number of hexadecimal digits may appear between \x{
       and }, but the value of the character code must be  less
       than  2**31  (that  is, the maximum hexadecimal value is
       7FFFFFFF). If characters other than  hexadecimal  digits
       appear  between \x{ and }, or if there is no terminating
       }, this form of escape is not recognized.  Instead,  the
       initial  \x  will  be interpreted as a basic hexadecimal
       escape, with no following  digits,  giving  a  character
       whose value is zero.

       Characters  whose  value is less than 256 can be defined
       by either of the two syntaxes for \x  when  PCRE  is  in
       UTF-8  mode.  There is no difference in the way they are
       handled. For  example,  \xdc  is  exactly  the  same  as
       \x{dc}.

       After  \0  up  to  two further octal digits are read. In
       both cases, if there are fewer  than  two  digits,  just
       those  that  are  present  are  used.  Thus the sequence
       \0\x\07 specifies two binary zeros  followed  by  a  BEL
       character  (code value 7). Make sure you supply two dig-
       its after the initial zero if the pattern character that
       follows is itself an octal digit.

       The  handling  of  a backslash followed by a digit other
       than 0 is complicated.  Outside a character class,  PCRE
       reads  it  and any following digits as a decimal number.
       If the number is less than 10, or if there have been  at
       least  that  many previous capturing left parentheses in
       the expression, the entire sequence is taken as  a  back
       reference.  A  description  of  how  this works is given
       later, following the discussion of parenthesized subpat-
       terns.

       Inside  a  character  class, or if the decimal number is
       greater than 9 and there have not been that many captur-
       ing  subpatterns, PCRE re-reads up to three octal digits
       following the backslash, and  generates  a  single  byte
       from the least significant 8 bits of the value. Any sub-
       sequent digits stand for themselves.  For example:

         \040   is another way of writing a space
         \40    is the same, provided there are fewer than 40
                   previous capturing subpatterns
         \7     is always a back reference
         \11    might be a back reference, or another way of
                   writing a tab
         \011   is always a tab
         \0113  is a tab followed by the character "3"
         \113   might be a back reference, otherwise the
                   character with octal code 113
         \377   might be a back reference, otherwise
                   the byte consisting entirely of 1 bits
         \81    is either a back reference, or a binary zero
                   followed by the two characters "8" and "1"

       Note that octal values of 100 or  greater  must  not  be
       introduced by a leading zero, because no more than three
       octal digits are ever read.

       All the sequences that define a single byte value  or  a
       single  UTF-8 character (in UTF-8 mode) can be used both
       inside  and  outside  character  classes.  In  addition,
       inside a character class, the sequence \b is interpreted
       as the backspace character (hex 08), and the sequence \X
       is interpreted as the character "X". Outside a character
       class, these  sequences  have  different  meanings  (see
       below).

   Generic character types

       The  third  use  of  backslash is for specifying generic
       character types. The following are always recognized:

         \d     any decimal digit
         \D     any character that is not a decimal digit
         \s     any whitespace character
         \S     any character that is not a whitespace  charac-
       ter
         \w     any "word" character
         \W     any "non-word" character

       Each  pair  of  escape sequences partitions the complete
       set of characters into  two  disjoint  sets.  Any  given
       character matches one, and only one, of each pair.

       These  character  type  sequences can appear both inside
       and outside character classes. They each match one char-
       acter  of  the appropriate type. If the current matching
       point is at the end of the subject string, all  of  them
       fail, since there is no character to match.

       For  compatibility  with  Perl, \s does not match the VT
       character (code 11).  This makes it different  from  the
       the  POSIX  "space" class. The \s characters are HT (9),
       LF (10), FF (12), CR (13), and space (32).

       A "word" character is an  underscore  or  any  character
       less  than 256 that is a letter or digit. The definition
       of letters and digits is controlled by PCRE's low-valued
       character tables, and may vary if locale-specific match-
       ing is taking place (see "Locale support" in the pcreapi
       page). For example, in the "fr_FR" (French) locale, some
       character codes greater than 128 are used  for  accented
       letters, and these are matched by \w.

       In  UTF-8  mode, characters with values greater than 128
       never match \d, \s, or \w, and always match \D, \S,  and
       \W.  This  is  true even when Unicode character property
       support is available.

   Unicode character properties

       When PCRE is built with Unicode character property  sup-
       port, three additional escape sequences to match generic
       character  types  are  available  when  UTF-8  mode   is
       selected. They are:

        \p{xx}   a character with the xx property
        \P{xx}   a character without the xx property
        \X       an extended Unicode sequence

       The  property  names represented by xx above are limited
       to the Unicode general category properties. Each charac-
       ter  has  exactly one such property, specified by a two-
       letter abbreviation. For compatibility with Perl,  nega-
       tion  can be specified by including a circumflex between
       the opening brace and the property  name.  For  example,
       \p{^Lu} is the same as \P{Lu}.

       If  only  one  letter  is  specified  with  \p or \P, it
       includes all the properties that start with that letter.
       In  this  case,  in  the  absence of negation, the curly
       brackets in the escape sequence are optional; these  two
       examples have the same effect:

         \p{L}
         \pL

       The following property codes are supported:

         C     Other
         Cc    Control
         Cf    Format
         Cn    Unassigned
         Co    Private use
         Cs    Surrogate

         L     Letter
         Ll    Lower case letter
         Lm    Modifier letter
         Lo    Other letter
         Lt    Title case letter
         Lu    Upper case letter

         M     Mark
         Mc    Spacing mark
         Me    Enclosing mark
         Mn    Non-spacing mark

         N     Number
         Nd    Decimal number
         Nl    Letter number
         No    Other number

         P     Punctuation
         Pc    Connector punctuation
         Pd    Dash punctuation
         Pe    Close punctuation
         Pf    Final punctuation
         Pi    Initial punctuation
         Po    Other punctuation
         Ps    Open punctuation

         S     Symbol
         Sc    Currency symbol
         Sk    Modifier symbol
         Sm    Mathematical symbol
         So    Other symbol

         Z     Separator
         Zl    Line separator
         Zp    Paragraph separator
         Zs    Space separator

       Extended  properties  such  as "Greek" or "InMusicalSym-
       bols" are not supported by PCRE.

       Specifying  caseless  matching  does  not  affect  these
       escape  sequences.  For  example,  \p{Lu} always matches
       only upper case letters.

       The \X escape matches any number of  Unicode  characters
       that form an extended Unicode sequence. \X is equivalent
       to

         (?>\PM\pM*)

       That is, it matches a character without the "mark" prop-
       erty,  followed  by  zero  or  more  characters with the
       "mark" property, and treats the sequence  as  an  atomic
       group  (see below).  Characters with the "mark" property
       are typically accents that affect the preceding  charac-
       ter.

       Matching  characters  by  Unicode  property is not fast,
       because PCRE has to search  a  structure  that  contains
       data  for  over fifteen thousand characters. That is why
       the traditional escape sequences such as \d  and  \w  do
       not use Unicode properties in PCRE.

   Simple assertions

       The fourth use of backslash is for certain simple asser-
       tions. An assertion specifies a condition that has to be
       met  at a particular point in a match, without consuming
       any characters from the subject string. The use of  sub-
       patterns  for  more  complicated assertions is described
       below.  The backslashed assertions are:

         \b     matches at a word boundary
         \B     matches when not at a word boundary
         \A     matches at start of subject
         \Z     matches at end of subject or before newline  at
       end
         \z     matches at end of subject
         \G     matches at first matching position in subject

       These  assertions  may  not  appear in character classes
       (but note that \b has a different  meaning,  namely  the
       backspace character, inside a character class).

       A  word  boundary  is  a  position in the subject string
       where the current character and the  previous  character
       do  not both match \w or \W (i.e. one matches \w and the
       other matches \W), or the start or end of the string  if
       the first or last character matches \w, respectively.

       The  \A,  \Z,  and  \z assertions differ from the tradi-
       tional circumflex and dollar (described in the next sec-
       tion) in that they only ever match at the very start and
       end of the subject string,  whatever  options  are  set.
       Thus,  they  are  independent  of  multiline mode. These
       three assertions are not affected by the PCRE_NOTBOL  or
       PCRE_NOTEOL  options, which affect only the behaviour of
       the circumflex and dollar  metacharacters.  However,  if
       the  startoffset  argument  of  pcre_exec() is non-zero,
       indicating that matching is to start at  a  point  other
       than  the  beginning of the subject, \A can never match.
       The difference between \Z and  \z  is  that  \Z  matches
       before  a  newline  that  is  the  last character of the
       string as well as at the end of the string,  whereas  \z
       matches only at the end.

       The  \G assertion is true only when the current matching
       position is at the start point of the match,  as  speci-
       fied by the startoffset argument of pcre_exec(). It dif-
       fers from \A when the value of startoffset is  non-zero.
       By  calling  pcre_exec() multiple times with appropriate
       arguments, you can mimic Perl's /g option, and it is  in
       this kind of implementation where \G can be useful.

       Note,  however, that PCRE's interpretation of \G, as the
       start of the current match,  is  subtly  different  from
       Perl's,  which  defines  it  as  the end of the previous
       match. In Perl, these can be different when  the  previ-
       ously  matched  string was empty. Because PCRE does just
       one match at a time, it cannot reproduce this behaviour.

       If  all the alternatives of a pattern begin with \G, the
       expression is anchored to the starting  match  position,
       and  the  "anchored" flag is set in the compiled regular
       expression.

CIRCUMFLEX AND DOLLAR

       Outside a character class, in the default matching mode,
       the  circumflex  character  is an assertion that is true
       only if the current matching point is at  the  start  of
       the  subject  string.  If  the  startoffset  argument of
       pcre_exec() is non-zero, circumflex can never  match  if
       the  PCRE_MULTILINE  option is unset. Inside a character
       class, circumflex has an entirely different meaning (see
       below).

       Circumflex  need  not be the first character of the pat-
       tern if a number of alternatives are  involved,  but  it
       should  be  the first thing in each alternative in which
       it appears if the pattern is ever to match that  branch.
       If  all  possible  alternatives start with a circumflex,
       that is, if the pattern is constrained to match only  at
       the start of the subject, it is said to be an "anchored"
       pattern. (There are also other constructs that can cause
       a pattern to be anchored.)

       A  dollar character is an assertion that is true only if
       the current matching point is at the end of the  subject
       string,  or  immediately before a newline character that
       is the last character in the string (by default). Dollar
       need  not be the last character of the pattern if a num-
       ber of alternatives are involved, but it should  be  the
       last item in any branch in which it appears.  Dollar has
       no special meaning in a character class.

       The meaning of dollar can be changed so that it  matches
       only  at  the  very  end  of  the string, by setting the
       PCRE_DOLLAR_ENDONLY option at compile  time.  This  does
       not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are
       changed if the PCRE_MULTILINE option is set.  When  this
       is  the  case,  they match immediately after and immedi-
       ately   before   an    internal    newline    character,
       respectively,  in  addition to matching at the start and
       end of the subject  string.  For  example,  the  pattern
       /^abc$/  matches the subject string "def\nabc" (where \n
       represents a newline character) in multiline  mode,  but
       not otherwise.  Consequently, patterns that are anchored
       in single line mode because all branches  start  with  ^
       are not anchored in multiline mode, and a match for cir-
       cumflex is possible when  the  startoffset  argument  of
       pcre_exec()  is non-zero. The PCRE_DOLLAR_ENDONLY option
       is ignored if PCRE_MULTILINE is set.

       Note that the sequences \A, \Z, and \z can  be  used  to
       match  the  start  and end of the subject in both modes,
       and if all branches of a pattern start  with  \A  it  is
       always anchored, whether PCRE_MULTILINE is set or not.

FULL STOP (PERIOD, DOT)

       Outside  a character class, a dot in the pattern matches
       any one character in the subject, including a non-print-
       ing  character,  but not (by default) newline.  In UTF-8
       mode, a dot matches any UTF-8 character, which might  be
       more than one byte long, except (by default) newline. If
       the PCRE_DOTALL option is set, dots  match  newlines  as
       well. The handling of dot is entirely independent of the
       handling of circumflex and dollar, the only relationship
       being that they both involve newline characters. Dot has
       no special meaning in a character class.

MATCHING A SINGLE BYTE

       Outside  a  character  class,  the  escape  sequence  \C
       matches  any  one  byte,  both in and out of UTF-8 mode.
       Unlike a dot, it can match a  newline.  The  feature  is
       provided  in  Perl in order to match individual bytes in
       UTF-8 mode. Because it breaks up UTF-8  characters  into
       individual  bytes,  what  remains in the string may be a
       malformed UTF-8 string. For this reason, the  \C  escape
       sequence is best avoided.

       PCRE  does  not  allow \C to appear in lookbehind asser-
       tions (described below),  because  in  UTF-8  mode  this
       would  make it impossible to calculate the length of the
       lookbehind.

SQUARE BRACKETS AND CHARACTER CLASSES

       An opening square bracket introduces a character  class,
       terminated by a closing square bracket. A closing square
       bracket on its own is not special. If a  closing  square
       bracket  is required as a member of the class, it should
       be the first data character in the class (after an  ini-
       tial  circumflex,  if  present)  or escaped with a back-
       slash.

       A character class matches a single character in the sub-
       ject.  In UTF-8 mode, the character may occupy more than
       one byte. A matched character must  be  in  the  set  of
       characters  defined by the class, unless the first char-
       acter in the class definition is a circumflex, in  which
       case  the  subject  character  must  not  be  in the set
       defined by  the  class.  If  a  circumflex  is  actually
       required  as a member of the class, ensure it is not the
       first character, or escape it with a backslash.

       For example, the character  class  [aeiou]  matches  any
       lower  case  vowel, while [^aeiou] matches any character
       that is not a lower case vowel. Note that  a  circumflex
       is just a convenient notation for specifying the charac-
       ters that are in the class by enumerating those that are
       not.  A  class  that  starts with a circumflex is not an
       assertion: it still consumes a character from  the  sub-
       ject  string,  and  therefore  it  fails  if the current
       pointer is at the end of the string.

       In UTF-8 mode, characters with values greater  than  255
       can be included in a class as a literal string of bytes,
       or by using the \x{ escaping mechanism.

       When caseless matching is set, any letters  in  a  class
       represent both their upper case and lower case versions,
       so for example, a caseless [aeiou] matches "A"  as  well
       as  "a",  and  a  caseless  [^aeiou] does not match "A",
       whereas a caseful version would.  In  UTF-8  mode,  PCRE
       always  understands  the  concept of case for characters
       whose values are less than 128, so caseless matching  is
       always  possible. For characters with higher values, the
       concept of case is supported if PCRE  is  compiled  with
       Unicode  property  support,  but  not otherwise.  If you
       want to use caseless matching  for  characters  128  and
       above,  you  must ensure that PCRE is compiled with Uni-
       code property support as well as with UTF-8 support.

       The newline character is never treated  in  any  special
       way  in  character  classes, whatever the setting of the
       PCRE_DOTALL or PCRE_MULTILINE options is. A  class  such
       as [^a] will always match a newline.

       The  minus  (hyphen)  character can be used to specify a
       range of characters in a character class.  For  example,
       [d-m]  matches any letter between d and m, inclusive. If
       a minus character is required in a  class,  it  must  be
       escaped  with  a backslash or appear in a position where
       it cannot be interpreted as indicating  a  range,  typi-
       cally as the first or last character in the class.

       It  is not possible to have the literal character "]" as
       the end character of a range. A pattern such as  [W-]46]
       is  interpreted  as  a  class of two characters ("W" and
       "-") followed by a literal string  "46]",  so  it  would
       match  "W46]"  or "-46]". However, if the "]" is escaped
       with a backslash it is interpreted as the end of  range,
       so [W-\]46] is interpreted as a class containing a range
       followed by two other characters. The octal or hexadeci-
       mal  representation  of  "]"  can  also be used to end a
       range.

       Ranges operate in the collating  sequence  of  character
       values.  They  can also be used for characters specified
       numerically, for example  [\000-\037].  In  UTF-8  mode,
       ranges  can  include characters whose values are greater
       than 255, for example [\x{100}-\x{2ff}].

       If a range that includes letters is used  when  caseless
       matching  is set, it matches the letters in either case.
       For example, [W-c] is  equivalent  to  [][\\^_`wxyzabc],
       matched  caselessly, and in non-UTF-8 mode, if character
       tables for the "fr_FR" locale are  in  use,  [\xc8-\xcb]
       matches  accented  E  characters in both cases. In UTF-8
       mode, PCRE supports the concept of case  for  characters
       with  values  greater  than 128 only when it is compiled
       with Unicode property support.

       The character types \d, \D, \p, \P, \s, \S, \w,  and  \W
       may  also appear in a character class, and add the char-
       acters that  they  match  to  the  class.  For  example,
       [\dABCDEF]  matches  any hexadecimal digit. A circumflex
       can conveniently be used with the upper  case  character
       types  to  specify  a  more restricted set of characters
       than the matching lower  case  type.  For  example,  the
       class [^\W_] matches any letter or digit, but not under-
       score.

       The only metacharacters that are recognized in character
       classes  are  backslash,  hyphen  (only  where it can be
       interpreted as specifying a range), circumflex (only  at
       the  start), opening square bracket (only when it can be
       interpreted as introducing a POSIX class name - see  the
       next   section),  and  the  terminating  closing  square
       bracket. However, escaping other non-alphanumeric  char-
       acters does no harm.

POSIX CHARACTER CLASSES

       Perl  supports the POSIX notation for character classes.
       This uses names enclosed by [: and :] within the enclos-
       ing  square  brackets. PCRE also supports this notation.
       For example,

         [01[:alpha:]%]

       matches "0", "1", any alphabetic character, or "%".  The
       supported class names are

         alnum    letters and digits
         alpha    letters
         ascii    character codes 0 - 127
         blank    space or tab only
         cntrl    control characters
         digit    decimal digits (same as \d)
         graph    printing characters, excluding space
         lower    lower case letters
         print    printing characters, including space
         punct     printing  characters,  excluding letters and
       digits
         space    white space (not quite the same as \s)
         upper    upper case letters
         word     "word" characters (same as \w)
         xdigit   hexadecimal digits

       The "space" characters are HT (9), LF (10), VT (11),  FF
       (12),  CR  (13),  and  space (32). Notice that this list
       includes the VT character (code 11). This makes  "space"
       different  to  \s,  which  does not include VT (for Perl
       compatibility).

       The name "word" is a Perl extension, and  "blank"  is  a
       GNU  extension  from Perl 5.8. Another Perl extension is
       negation, which is indicated by a ^ character after  the
       colon. For example,

         [12[:^digit:]]

       matches "1", "2", or any non-digit. PCRE (and Perl) also
       recognize the POSIX syntax [.ch.] and [=ch=] where  "ch"
       is  a  "collating element", but these are not supported,
       and an error is given if they are encountered.

       In UTF-8 mode, characters with values greater  than  128
       do not match any of the POSIX character classes.

VERTICAL BAR

       Vertical bar characters are used to separate alternative
       patterns. For example, the pattern

         gilbert|sullivan

       matches either "gilbert" or "sullivan".  Any  number  of
       alternatives  may  appear,  and  an empty alternative is
       permitted (matching the  empty  string).   The  matching
       process  tries  each  alternative  in turn, from left to
       right, and the first one that succeeds is used.  If  the
       alternatives  are  within  a subpattern (defined below),
       "succeeds" means matching the rest of the  main  pattern
       as well as the alternative in the subpattern.

INTERNAL OPTION SETTING

       The   settings  of  the  PCRE_CASELESS,  PCRE_MULTILINE,
       PCRE_DOTALL, and PCRE_EXTENDED options  can  be  changed
       from  within  the  pattern  by a sequence of Perl option
       letters enclosed between "(?" and ")". The  option  let-
       ters are

         i  for PCRE_CASELESS
         m  for PCRE_MULTILINE
         s  for PCRE_DOTALL
         x  for PCRE_EXTENDED

       For example, (?im) sets caseless, multiline matching. It
       is also possible to unset these options by preceding the
       letter  with a hyphen, and a combined setting and unset-
       ting such as  (?im-sx),  which  sets  PCRE_CASELESS  and
       PCRE_MULTILINE    while    unsetting   PCRE_DOTALL   and
       PCRE_EXTENDED, is also permitted. If  a  letter  appears
       both before and after the hyphen, the option is unset.

       When  an option change occurs at top level (that is, not
       inside subpattern parentheses), the  change  applies  to
       the  remainder  of  the  pattern  that  follows.  If the
       change is placed right at the start of a  pattern,  PCRE
       extracts  it into the global options (and it will there-
       fore show up in data extracted  by  the  pcre_fullinfo()
       function).

       An  option  change within a subpattern affects only that
       part of the current pattern that follows it, so

         (a(?i)b)c

       matches abc and  aBc  and  no  other  strings  (assuming
       PCRE_CASELESS  is not used).  By this means, options can
       be made to have different settings in different parts of
       the  pattern.  Any  changes  made  in one alternative do
       carry on into subsequent branches within the  same  sub-
       pattern. For example,

         (a(?i)b|c)

       matches  "ab",  "aB",  "c",  and  "C",  even though when
       matching "C" the first branch is  abandoned  before  the
       option  setting.  This  is because the effects of option
       settings happen at compile time.  There  would  be  some
       very weird behaviour otherwise.

       The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA
       can be changed in the same way  as  the  Perl-compatible
       options  by  using  the characters U and X respectively.
       The (?X) flag setting is special in that it must  always
       occur  earlier in the pattern than any of the additional
       features it turns on, even when it is at top  level.  It
       is best to put it at the start.

SUBPATTERNS

       Subpatterns  are  delimited by parentheses (round brack-
       ets), which can be nested.  Turning part  of  a  pattern
       into a subpattern does two things:

       1.  It localizes a set of alternatives. For example, the
       pattern

         cat(aract|erpillar|)

       matches one of the words "cat", "cataract",  or  "cater-
       pillar".   Without   the  parentheses,  it  would  match
       "cataract", "erpillar" or the empty string.

       2. It sets up the subpattern as a capturing  subpattern.
       This  means  that,  when the whole pattern matches, that
       portion of the subject string that matched  the  subpat-
       tern  is passed back to the caller via the ovector argu-
       ment of pcre_exec().  Opening  parentheses  are  counted
       from  left  to right (starting from 1) to obtain numbers
       for the capturing subpatterns.

       For example, if the string "the  red  king"  is  matched
       against the pattern

         the ((red|white) (king|queen))

       the  captured  substrings  are  "red  king",  "red", and
       "king", and are numbered 1, 2, and 3, respectively.

       The fact that plain parentheses fulfil two functions  is
       not always helpful.  There are often times when a group-
       ing subpattern is required without a capturing  require-
       ment.  If  an opening parenthesis is followed by a ques-
       tion mark and a colon, the subpattern does  not  do  any
       capturing,  and is not counted when computing the number
       of any subsequent capturing subpatterns. For example, if
       the string "the white queen" is matched against the pat-
       tern

         the ((?:red|white) (king|queen))

       the captured substrings are "white queen"  and  "queen",
       and  are numbered 1 and 2. The maximum number of captur-
       ing subpatterns is 65535, and the maximum depth of nest-
       ing  of  all subpatterns, both capturing and non-captur-
       ing, is 200.

       As a convenient shorthand, if any  option  settings  are
       required at the start of a non-capturing subpattern, the
       option letters may appear between the "?" and  the  ":".
       Thus the two patterns

         (?i:saturday|sunday)
         (?:(?i)saturday|sunday)

       match  exactly the same set of strings. Because alterna-
       tive branches are tried from left to right, and  options
       are  not  reset  until  the  end  of  the  subpattern is
       reached, an option setting in  one  branch  does  affect
       subsequent  branches,  so the above patterns match "SUN-
       DAY" as well as "Saturday".

NAMED SUBPATTERNS

       Identifying capturing parentheses by number  is  simple,
       but  it can be very hard to keep track of the numbers in
       complicated  regular  expressions.  Furthermore,  if  an
       expression  is modified, the numbers may change. To help
       with this difficulty, PCRE supports the naming  of  sub-
       patterns,  something  that  Perl  does  not provide. The
       Python syntax (?P<name>...) is used.  Names  consist  of
       alphanumeric  characters  and  underscores,  and must be
       unique within a pattern.

       Named capturing parentheses are still allocated  numbers
       as  well  as names. The PCRE API provides function calls
       for extracting the name-to-number translation table from
       a compiled pattern. There is also a convenience function
       for extracting a captured substring by name. For further
       details see the pcreapi documentation.

REPETITION

       Repetition is specified by quantifiers, which can follow
       any of the following items:

         a literal data character
         the . metacharacter
         the \C escape sequence
         the \X escape sequence (in  UTF-8  mode  with  Unicode
       properties)
         an escape such as \d that matches a single character
         a character class
         a back reference (see next section)
         a parenthesized subpattern (unless it is an assertion)

       The general repetition quantifier  specifies  a  minimum
       and  maximum  number of permitted matches, by giving the
       two numbers in curly brackets (braces), separated  by  a
       comma.  The  numbers  must  be  less than 65536, and the
       first must be less than or  equal  to  the  second.  For
       example:

         z{2,4}

       matches  "zz",  "zzz", or "zzzz". A closing brace on its
       own is not a special character. If the second number  is
       omitted,  but  the  comma  is present, there is no upper
       limit; if the second number and the comma are both omit-
       ted,   the  quantifier  specifies  an  exact  number  of
       required matches. Thus

         [aeiou]{3,}

       matches at least 3 successive vowels, but may match many
       more, while

         \d{8}

       matches  exactly 8 digits. An opening curly bracket that
       appears in a position where a quantifier is not allowed,
       or  one  that does not match the syntax of a quantifier,
       is taken as a literal character. For  example,  {,6}  is
       not  a  quantifier, but a literal string of four charac-
       ters.

       In UTF-8 mode, quantifiers  apply  to  UTF-8  characters
       rather  than  to  individual  bytes.  Thus, for example,
       \x{100}{2} matches two UTF-8 characters, each  of  which
       is  represented  by a two-byte sequence. Similarly, when
       Unicode property support  is  available,  \X{3}  matches
       three  Unicode  extended sequences, each of which may be
       several  bytes  long  (and  they  may  be  of  different
       lengths).

       The  quantifier {0} is permitted, causing the expression
       to behave as if the previous  item  and  the  quantifier
       were not present.

       For convenience (and historical compatibility) the three
       most common quantifiers have single-character  abbrevia-
       tions:

         *    is equivalent to {0,}
         +    is equivalent to {1,}
         ?    is equivalent to {0,1}

       It  is possible to construct infinite loops by following
       a subpattern that can match no characters with a quanti-
       fier that has no upper limit, for example:

         (a?)*

       Earlier  versions of Perl and PCRE used to give an error
       at compile time  for  such  patterns.  However,  because
       there  are cases where this can be useful, such patterns
       are now accepted, but if any repetition of  the  subpat-
       tern  does  in  fact  match  no  characters, the loop is
       forcibly broken.

       By default, the quantifiers are "greedy", that is,  they
       match  as  much as possible (up to the maximum number of
       permitted times), without causing the rest of  the  pat-
       tern  to  fail.  The classic example of where this gives
       problems is in trying to match comments in  C  programs.
       These  appear  between /* and */ and within the comment,
       individual * and / characters may appear. An attempt  to
       match C comments by applying the pattern

         /\*.*\*/

       to the string

         /* first comment */  not comment  /* second comment */

       fails, because it matches the entire string owing to the
       greediness of the .*  item.

       However, if a quantifier is followed by a question mark,
       it ceases to be greedy, and instead matches the  minimum
       number of times possible, so the pattern

         /\*.*?\*/

       does the right thing with the C comments. The meaning of
       the various quantifiers is not otherwise  changed,  just
       the  preferred  number  of matches.  Do not confuse this
       use of question mark with its use as a quantifier in its
       own  right.  Because  it  has two uses, it can sometimes
       appear doubled, as in

         \d??\d

       which matches one digit by preference, but can match two
       if that is the only way the rest of the pattern matches.

       If the PCRE_UNGREEDY option is set (an option  which  is
       not  available  in Perl), the quantifiers are not greedy
       by default, but individual ones can be  made  greedy  by
       following  them with a question mark. In other words, it
       inverts the default behaviour.

       When a parenthesized subpattern  is  quantified  with  a
       minimum  repeat  count  that is greater than 1 or with a
       limited maximum, more memory is required  for  the  com-
       piled  pattern, in proportion to the size of the minimum
       or maximum.

       If a pattern starts with .* or .{0,} and the PCRE_DOTALL
       option  (equivalent  to Perl's /s) is set, thus allowing
       the . to  match  newlines,  the  pattern  is  implicitly
       anchored, because whatever follows will be tried against
       every character position in the subject string, so there
       is  no  point  in  retrying  the  overall  match  at any
       position after the first. PCRE normally  treats  such  a
       pattern as though it were preceded by \A.

       In  cases where it is known that the subject string con-
       tains no newlines, it is worth  setting  PCRE_DOTALL  in
       order  to  obtain  this  optimization,  or alternatively
       using ^ to indicate anchoring explicitly.

       However, there is one situation where  the  optimization
       cannot be used. When .*  is inside capturing parentheses
       that are the subject of a backreference elsewhere in the
       pattern,  a match at the start may fail, and a later one
       succeed. Consider, for example:

         (.*)abc\1

       If the subject is "xyz123abc123" the match point is  the
       fourth character. For this reason, such a pattern is not
       implicitly anchored.

       When a capturing subpattern is repeated, the value  cap-
       tured is the substring that matched the final iteration.
       For example, after

         (tweedle[dume]{3}\s*)+

       has matched "tweedledum tweedledee"  the  value  of  the
       captured  substring  is  "tweedledee". However, if there
       are nested capturing subpatterns, the corresponding cap-
       tured  values  may have been set in previous iterations.
       For example, after

         /(a|(b))+/

       matches "aba" the value of the second captured substring
       is "b".

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

       With  both maximizing and minimizing repetition, failure
       of what follows normally causes the repeated item to  be
       re-evaluated  to  see  if  a different number of repeats
       allows the rest of the pattern to match. Sometimes it is
       useful  to  prevent this, either to change the nature of
       the match, or to cause it fail earlier than it otherwise
       might,  when the author of the pattern knows there is no
       point in carrying on.

       Consider, for example, the pattern \d+foo  when  applied
       to the subject line

         123456bar

       After  matching  all  6 digits and then failing to match
       "foo", the normal action of the matcher is to try  again
       with  only 5 digits matching the \d+ item, and then with
       4, and so on, before ultimately failing. "Atomic  group-
       ing"  (a term taken from Jeffrey Friedl's book) provides
       the means for specifying  that  once  a  subpattern  has
       matched, it is not to be re-evaluated in this way.

       If  we use atomic grouping for the previous example, the
       matcher would give up immediately on  failing  to  match
       "foo"  the first time. The notation is a kind of special
       parenthesis, starting with (?> as in this example:

         (?>\d+)foo

       This kind of parenthesis "locks up"  the   part  of  the
       pattern  it  contains once it has matched, and a failure
       further into the pattern is prevented from  backtracking
       into  it.  Backtracking  past it to previous items, how-
       ever, works as normal.

       An alternative description is that a subpattern of  this
       type  matches the string of characters that an identical
       standalone pattern would match, if anchored at the  cur-
       rent point in the subject string.

       Atomic  grouping  subpatterns  are not capturing subpat-
       terns. Simple cases such as the  above  example  can  be
       thought  of  as  a  maximizing  repeat that must swallow
       everything it can. So, while both \d+ and \d+? are  pre-
       pared to adjust the number of digits they match in order
       to make the rest of the pattern match, (?>\d+) can  only
       match an entire sequence of digits.

       Atomic groups in general can of course contain arbitrar-
       ily complicated subpatterns, and can be nested. However,
       when the subpattern for an atomic group is just a single
       repeated item, as in the example above, a simpler  nota-
       tion, called a "possessive quantifier" can be used. This
       consists of an additional + character following a  quan-
       tifier. Using this notation, the previous example can be
       rewritten as

         \d++foo

       Possessive quantifiers are always greedy; the setting of
       the  PCRE_UNGREEDY  option is ignored. They are a conve-
       nient notation for the simpler forms  of  atomic  group.
       However,  there  is no difference in the meaning or pro-
       cessing of a possessive quantifier  and  the  equivalent
       atomic group.

       The  possessive quantifier syntax is an extension to the
       Perl syntax. It originates in Sun's Java package.

       When a pattern contains an  unlimited  repeat  inside  a
       subpattern that can itself be repeated an unlimited num-
       ber of times, the use of an atomic group is the only way
       to  avoid  some  failing matches taking a very long time
       indeed. The pattern

         (\D+|<\d+>)*[!?]

       matches an unlimited number of  substrings  that  either
       consist  of  non-digits,  or digits enclosed in <>, fol-
       lowed by either  !  or  ?.  When  it  matches,  it  runs
       quickly. However, if it is applied to

         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

       it  takes  a long time before reporting failure. This is
       because the string can be divided between  the  internal
       \D+  repeat  and the external * repeat in a large number
       of ways, and all have to be  tried.  (The  example  uses
       [!?]  rather than a single character at the end, because
       both PCRE and Perl have an optimization that allows  for
       fast  failure  when  a  single  character  is used. They
       remember the last single character that is required  for
       a  match,  and  fail  early  if it is not present in the
       string.) If the pattern is changed so that  it  uses  an
       atomic group, like this:

         ((?>\D+)|<\d+>)*[!?]

       sequences  of  non-digits  cannot be broken, and failure
       happens quickly.

BACK REFERENCES

       Outside a character class, a  backslash  followed  by  a
       digit  greater than 0 (and possibly further digits) is a
       back reference to a capturing subpattern  earlier  (that
       is,  to  its  left)  in the pattern, provided there have
       been that many previous capturing left parentheses.

       However, if the decimal number following  the  backslash
       is less than 10, it is always taken as a back reference,
       and causes an error only if there are not that many cap-
       turing  left parentheses in the entire pattern. In other
       words, the parentheses that are referenced need  not  be
       to  the  left of the reference for numbers less than 10.
       See the subsection  entitled  "Non-printing  characters"
       above for further details of the handling of digits fol-
       lowing a backslash.

       A back reference matches whatever actually  matched  the
       capturing  subpattern  in  the  current  subject string,
       rather than anything matching the subpattern itself (see
       "Subpatterns  as  subroutines"  below for a way of doing
       that). So the pattern

         (sens|respons)e and \1ibility

       matches  "sense  and  sensibility"  and  "response   and
       responsibility",  but not "sense and responsibility". If
       caseful matching is in force at the  time  of  the  back
       reference, the case of letters is relevant. For example,

         ((?i)rah)\s+\1

       matches "rah rah" and "RAH RAH", but not "RAH rah", even
       though  the  original  capturing  subpattern  is matched
       caselessly.

       Back references to named subpatterns use the Python syn-
       tax  (?P=name).  We  could  rewrite the above example as
       follows:

         (?<p1>(?i)rah)\s+(?P=p1)

       There may be more than one back reference  to  the  same
       subpattern.  If  a subpattern has not actually been used
       in a particular match, any back references to it  always
       fail. For example, the pattern

         (a|(bc))\2

       always fails if it starts to match "a" rather than "bc".
       Because there may be many  capturing  parentheses  in  a
       pattern, all digits following the backslash are taken as
       part of a potential back reference number. If  the  pat-
       tern  continues  with  a digit character, some delimiter
       must be used to terminate the  back  reference.  If  the
       PCRE_EXTENDED  option  is  set,  this can be whitespace.
       Otherwise an empty comment (see "Comments" below) can be
       used.

       A  back  reference that occurs inside the parentheses to
       which it refers fails when the subpattern is first used,
       so,  for  example,  (a\1)  never matches.  However, such
       references can be useful  inside  repeated  subpatterns.
       For example, the pattern

         (a|b\1)+

       matches  any  number  of  "a"s and also "aba", "ababbaa"
       etc. At each  iteration  of  the  subpattern,  the  back
       reference  matches the character string corresponding to
       the previous iteration. In order for this to  work,  the
       pattern  must  be such that the first iteration does not
       need to match the back reference. This can be done using
       alternation, as in the example above, or by a quantifier
       with a minimum of zero.

ASSERTIONS

       An assertion is a test on the  characters  following  or
       preceding the current matching point that does not actu-
       ally consume any characters. The simple assertions coded
       as  \b, \B, \A, \G, \Z, \z, ^ and $ are described above.

       More complicated assertions are  coded  as  subpatterns.
       There  are  two kinds: those that look ahead of the cur-
       rent position in the subject string, and those that look
       behind  it.  An  assertion  subpattern is matched in the
       normal way, except that it does not  cause  the  current
       matching position to be changed.

       Assertion subpatterns are not capturing subpatterns, and
       may not be repeated, because it makes no sense to assert
       the  same  thing several times. If any kind of assertion
       contains capturing  subpatterns  within  it,  these  are
       counted for the purposes of numbering the capturing sub-
       patterns in the whole pattern.  However, substring  cap-
       turing  is  carried  out  only  for positive assertions,
       because it does not make sense for negative  assertions.

   Lookahead assertions

       Lookahead  assertions start with (?= for positive asser-
       tions and (?! for negative assertions. For example,

         \w+(?=;)

       matches a word followed by a  semicolon,  but  does  not
       include the semicolon in the match, and

         foo(?!bar)

       matches  any occurrence of "foo" that is not followed by
       "bar". Note that the apparently similar pattern

         (?!foo)bar

       does not find an occurrence of "bar" that is preceded by
       something  other  than "foo"; it finds any occurrence of
       "bar"  whatsoever,  because  the  assertion  (?!foo)  is
       always  true when the next three characters are "bar". A
       lookbehind assertion is  needed  to  achieve  the  other
       effect.

       If you want to force a matching failure at some point in
       a pattern, the most convenient way to do it is with (?!)
       because  an empty string always matches, so an assertion
       that requires there not  to  be  an  empty  string  must
       always fail.

   Lookbehind assertions

       Lookbehind  assertions  start  with  (?<=  for  positive
       assertions and (?<! for negative assertions.  For  exam-
       ple,

         (?<!foo)bar

       does find an occurrence of "bar" that is not preceded by
       "foo".  The  contents  of  a  lookbehind  assertion  are
       restricted  such  that  all  the strings it matches must
       have a fixed  length.  However,  if  there  are  several
       alternatives,  they  do  not  all  have to have the same
       fixed length. Thus

         (?<=bullock|donkey)

       is permitted, but

         (?<!dogs?|cats?)

       causes an error at compile  time.  Branches  that  match
       different  length  strings are permitted only at the top
       level of a lookbehind assertion. This  is  an  extension
       compared  with  Perl  (at least for 5.8), which requires
       all branches to match the  same  length  of  string.  An
       assertion such as

         (?<=ab(c|de))

       is  not  permitted,  because its single top-level branch
       can match two different lengths, but it is acceptable if
       rewritten to use two top-level branches:

         (?<=abc|abde)

       The implementation of lookbehind assertions is, for each
       alternative, to temporarily move  the  current  position
       back  by the fixed width and then try to match. If there
       are insufficient characters before the current position,
       the match is deemed to fail.

       PCRE  does not allow the \C escape (which matches a sin-
       gle byte in UTF-8 mode) to appear in  lookbehind  asser-
       tions,  because  it makes it impossible to calculate the
       length of the lookbehind. The \X escape, which can match
       different numbers of bytes, is also not permitted.

       Atomic groups can be used in conjunction with lookbehind
       assertions to specify efficient matching at the  end  of
       the subject string. Consider a simple pattern such as

         abcd$

       when  applied  to  a  long  string  that does not match.
       Because matching proceeds from left to right, PCRE  will
       look  for  each  "a" in the subject and then see if what
       follows matches the rest of the pattern. If the  pattern
       is specified as

         ^.*abcd$

       the  initial  .* matches the entire string at first, but
       when this fails (because there is no following "a"),  it
       backtracks to match all but the last character, then all
       but the last two characters, and so on. Once  again  the
       search  for  "a" covers the entire string, from right to
       left, so we are no better off. However, if  the  pattern
       is written as

         ^(?>.*)(?<=abcd)

       or,  equivalently,  using the possessive quantifier syn-
       tax,

         ^.*+(?<=abcd)

       there can be no backtracking for the  .*  item;  it  can
       match  only the entire string. The subsequent lookbehind
       assertion  does  a  single  test  on   the   last   four
       characters.  If  it  fails, the match fails immediately.
       For long strings, this approach makes a significant dif-
       ference to the processing time.

   Using multiple assertions

       Several  assertions  (of  any sort) may occur in succes-
       sion. For example,

         (?<=\d{3})(?<!999)foo

       matches "foo" preceded by  three  digits  that  are  not
       "999".  Notice  that  each  of the assertions is applied
       independently at the same point in the  subject  string.
       First  there  is a check that the previous three charac-
       ters are all digits, and then there is a check that  the
       same  three characters are not "999".  This pattern does
       not match "foo" preceded by six characters, the first of
       which  are  digits  and  the last three of which are not
       "999". For example, it doesn't match "123abcfoo". A pat-
       tern to do that is

         (?<=\d{3}...)(?<!999)foo

       This time the first assertion looks at the preceding six
       characters, checking that the first  three  are  digits,
       and  then the second assertion checks that the preceding
       three characters are not "999".

       Assertions can be nested in any combination.  For  exam-
       ple,

         (?<=(?<!foo)bar)baz

       matches an occurrence of "baz" that is preceded by "bar"
       which in turn is not preceded by "foo", while

         (?<=\d{3}(?!999)...)foo

       is another pattern that matches "foo" preceded by  three
       digits and any three characters that are not "999".

CONDITIONAL SUBPATTERNS

       It  is  possible to cause the matching process to obey a
       subpattern conditionally or to choose between two alter-
       native subpatterns, depending on the result of an asser-
       tion, or whether a previous capturing subpattern matched
       or not. The two possible forms of conditional subpattern
       are

         (?(condition)yes-pattern)
         (?(condition)yes-pattern|no-pattern)

       If the condition is satisfied, the yes-pattern is  used;
       otherwise  the no-pattern (if present) is used. If there
       are more than two alternatives in the subpattern, a com-
       pile-time error occurs.

       There  are three kinds of condition. If the text between
       the parentheses consists of a sequence  of  digits,  the
       condition  is  satisfied  if the capturing subpattern of
       that number has previously matched. The number  must  be
       greater than zero. Consider the following pattern, which
       contains non-significant white space  to  make  it  more
       readable (assume the PCRE_EXTENDED option) and to divide
       it into three parts for ease of discussion:

         ( \( )?    [^()]+    (?(1) \) )

       The first part matches an optional opening  parenthesis,
       and  if  that character is present, sets it as the first
       captured substring. The second part matches one or  more
       characters that are not parentheses. The third part is a
       conditional subpattern that tests whether the first  set
       of  parentheses matched or not. If they did, that is, if
       subject started with an opening parenthesis, the  condi-
       tion  is  true, and so the yes-pattern is executed and a
       closing parenthesis is required.  Otherwise,  since  no-
       pattern  is not present, the subpattern matches nothing.
       In other words, this pattern matches a sequence of  non-
       parentheses, optionally enclosed in parentheses.

       If the condition is the string (R), it is satisfied if a
       recursive call to the pattern  or  subpattern  has  been
       made. At "top level", the condition is false.  This is a
       PCRE extension. Recursive patterns are described in  the
       next section.

       If  the condition is not a sequence of digits or (R), it
       must be an assertion.  This may be a positive  or  nega-
       tive  lookahead  or  lookbehind assertion. Consider this
       pattern, again containing non-significant  white  space,
       and with the two alternatives on the second line:

         (?(?=[^a-z]*[a-z])
         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The  condition  is  a  positive lookahead assertion that
       matches an optional sequence of non-letters followed  by
       a  letter.  In other words, it tests for the presence of
       at least one letter in  the  subject.  If  a  letter  is
       found, the subject is matched against the first alterna-
       tive; otherwise it is matched against the  second.  This
       pattern  matches strings in one of the two forms dd-aaa-
       dd or dd-dd-dd, where aaa are letters and dd are digits.

COMMENTS

       The  sequence (?# marks the start of a comment that con-
       tinues up to the next closing parenthesis. Nested paren-
       theses  are not permitted. The characters that make up a
       comment play no part in the pattern matching at all.

       If the PCRE_EXTENDED option is set, an unescaped # char-
       acter  outside  a  character  class introduces a comment
       that continues up to the next newline character  in  the
       pattern.

RECURSIVE PATTERNS

       Consider  the  problem of matching a string in parenthe-
       ses, allowing for unlimited nested parentheses.  Without
       the  use  of  recursion, the best that can be done is to
       use a pattern that matches up to  some  fixed  depth  of
       nesting. It is not possible to handle an arbitrary nest-
       ing depth. Perl provides a facility that allows  regular
       expressions  to  recurse (amongst other things). It does
       this by interpolating Perl code in the expression at run
       time, and the code can refer to the expression itself. A
       Perl pattern to solve the  parentheses  problem  can  be
       created like this:

         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       The  (?p{...})  item interpolates Perl code at run time,
       and in this case refers recursively to  the  pattern  in
       which  it  appears.  Obviously,  PCRE cannot support the
       interpolation of Perl code. Instead,  it  supports  some
       special  syntax for recursion of the entire pattern, and
       also for individual subpattern recursion.

       The special item that consists of (? followed by a  num-
       ber  greater  than  zero  and a closing parenthesis is a
       recursive call of the subpattern of  the  given  number,
       provided that it occurs inside that subpattern. (If not,
       it is a "subroutine" call, which  is  described  in  the
       next section.) The special item (?R) is a recursive call
       of the entire regular expression.

       For example, this PCRE pattern solves the nested  paren-
       theses  problem  (assume the PCRE_EXTENDED option is set
       so that white space is ignored):

         \( ( (?>[^()]+) | (?R) )* \)

       First it matches an opening parenthesis. Then it matches
       any  number of substrings which can either be a sequence
       of non-parentheses, or a recursive match of the  pattern
       itself  (that  is  a correctly parenthesized substring).
       Finally there is a closing parenthesis.

       If this were part of a larger  pattern,  you  would  not
       want to recurse the entire pattern, so instead you could
       use this:

         ( \( ( (?>[^()]+) | (?1) )* \) )

       We have put the pattern into parentheses, and caused the
       recursion to refer to them instead of the whole pattern.
       In a larger pattern, keeping track of  parenthesis  num-
       bers  can  be  tricky.  It may be more convenient to use
       named  parentheses  instead.   For   this,   PCRE   uses
       (?P>name),  which  is  an extension to the Python syntax
       that PCRE uses for named parentheses (Perl does not pro-
       vide  named  parentheses).  We  could  rewrite the above
       example as follows:

         (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )

       This particular example pattern contains  nested  unlim-
       ited  repeats,  and  so  the  use of atomic grouping for
       matching strings of non-parentheses  is  important  when
       applying  the  pattern to strings that do not match. For
       example, when this pattern is applied to

         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

       it  yields "no match" quickly. However, if atomic group-
       ing is not used, the match runs for  a  very  long  time
       indeed  because  there  are so many different ways the +
       and * repeats can carve up the subject, and all have  to
       be tested before failure can be reported.

       At  the end of a match, the values set for any capturing
       subpatterns are those from the outermost  level  of  the
       recursion  at which the subpattern value is set.  If you
       want to obtain intermediate values, a  callout  function
       can  be  used  (see the next section and the pcrecallout
       documentation). If the pattern above is matched against

         (ab(cd)ef)

       the value for the capturing parentheses is  "ef",  which
       is  the  last  value taken on at the top level. If addi-
       tional parentheses are added, giving

         \( ( ( (?>[^()]+) | (?R) )* ) \)
            ^                        ^
            ^                        ^

       the string they capture is "ab(cd)ef", the  contents  of
       the  top  level  parentheses.  If there are more than 15
       capturing parentheses in a pattern, PCRE has  to  obtain
       extra  memory to store data during a recursion, which it
       does by using  pcre_malloc,  freeing  it  via  pcre_free
       afterwards.  If  no  memory  can  be obtained, the match
       fails with the PCRE_ERROR_NOMEMORY error.

       Do not confuse the (?R) item  with  the  condition  (R),
       which tests for recursion.  Consider this pattern, which
       matches text in angle brackets, allowing  for  arbitrary
       nesting.  Only  digits  are  allowed  in nested brackets
       (that is, when recursing), whereas  any  characters  are
       permitted at the outer level.

         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >

       In  this  pattern,  (?(R)  is the start of a conditional
       subpattern, with  two  different  alternatives  for  the
       recursive  and non-recursive cases. The (?R) item is the
       actual recursive call.

SUBPATTERNS AS SUBROUTINES

       If the  syntax  for  a  recursive  subpattern  reference
       (either by number or by name) is used outside the paren-
       theses to which it refers, it operates like a subroutine
       in  a  programming  language. An earlier example pointed
       out that the pattern

         (sens|respons)e and \1ibility

       matches  "sense  and  sensibility"  and  "response   and
       responsibility",  but not "sense and responsibility". If
       instead the pattern

         (sens|respons)e and (?1)ibility

       is used, it does match  "sense  and  responsibility"  as
       well  as  the  other  two strings. Such references must,
       however, follow the subpattern to which they refer.

CALLOUTS

       Perl has a feature whereby using the  sequence  (?{...})
       causes arbitrary Perl code to be obeyed in the middle of
       matching a regular expression. This makes  it  possible,
       amongst  other  things,  to extract different substrings
       that match the same pair of parentheses when there is  a
       repetition.

       PCRE provides a similar feature, but of course it cannot
       obey arbitrary Perl code. The feature is  called  "call-
       out".  The  caller of PCRE provides an external function
       by putting  its  entry  point  in  the  global  variable
       pcre_callout.   By default, this variable contains NULL,
       which disables all calling out.

       Within a regular expression, (?C) indicates  the  points
       at  which  the external function is to be called. If you
       want to identify different callout points, you can put a
       number  less  than  256  after the letter C. The default
       value is zero.  For example, this pattern has two  call-
       out points:

         (?C1)abc(?C2)def

       If  the  PCRE_AUTO_CALLOUT  flag  is passed to pcre_com-
       pile(), callouts are automatically installed before each
       item in the pattern. They are all numbered 255.

       During  matching, when PCRE reaches a callout point (and
       pcre_callout is set), the external function  is  called.
       It is provided with the number of the callout, the posi-
       tion in the pattern, and, optionally, one item  of  data
       originally  supplied  by  the caller of pcre_exec(). The
       callout function may cause matching to proceed, to back-
       track,  or to fail altogether. A complete description of
       the interface to the callout function is  given  in  the
       pcrecallout documentation.

Last updated: 28 February 2005
Copyright (c) 1997-2005 University of Cambridge.



                                                 PCREPATTERN(3)
