Linux cli command pcrepattern
10 minute read
NAME 🖥️ pcrepattern 🖥️
Perl-compatible regular expressions
PCRE REGULAR EXPRESSION DETAILS
The syntax and semantics of the regular expressions that are supported by PCRE are described in detail below. There is a quick-reference syntax summary in the pcresyntax page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, .NET, and Oniguruma.
Perl’s regular expressions are described in its own documentation, and regular expressions in general are covered in a number of books, some of which have copious examples. Jeffrey Friedl’s “Mastering Regular Expressions”, published by O’Reilly, covers regular expressions in great detail. This description of PCRE’s regular expressions is intended as reference material.
This document discusses the patterns that are supported by PCRE when one its main matching functions, pcre_exec() (8-bit) or pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has alternative matching functions, pcre_dfa_exec() and pcre[16|32_dfa_exec(), which match using a different algorithm that is not Perl-compatible. Some of the features discussed below are not available when DFA matching is used. The advantages and disadvantages of the alternative functions, and how they differ from the normal functions, are discussed in the pcrematching page.
SPECIAL START-OF-PATTERN ITEMS
A number of options that can be passed to pcre_compile() can also be set by special items at the start of a pattern. These are not Perl-compatible, but are provided to make these options accessible to pattern writers who are not able to change the program that processes the pattern. Any number of these items may appear, but they must all be together right at the start of the pattern string, and the letters must be in upper case.
UTF support
The original operation of PCRE was on strings of one-byte characters. However, there is now also support for UTF-8 strings in the original library, an extra library that supports 16-bit and UTF-16 character strings, and a third library that supports 32-bit and UTF-32 character strings. To use these features, PCRE must be built to include appropriate support. When using UTF strings you must either call the compiling function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of these special sequences:
(*UTF8) (*UTF16) (*UTF32) (*UTF)
(*UTF) is a generic sequence that can be used with any of the libraries. Starting a pattern with such a sequence is equivalent to setting the relevant option. How setting a UTF mode affects pattern matching is mentioned in several places below. There is also a summary of features in the pcreunicode page.
Some applications that allow their users to supply patterns may wish to restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF option is set at compile time, (*UTF) etc. are not allowed, and their appearance causes an error.
Unicode property support
Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.
Disabling auto-possessification
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making quantifiers possessive when what follows cannot match the repeated item. For example, by default a+b is treated as a++b. For more details, see the pcreapi documentation.
Disabling start-up optimizations
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables several optimizations for quickly reaching “no match” results. For more details, see the pcreapi documentation.
Newline conventions
PCRE supports five different conventions for indicating line breaks in strings: a single CR (carriage return) character, a single LF (linefeed) character, the two-character sequence CRLF, any of the three preceding, or any Unicode newline sequence. The pcreapi page has further discussion about newlines, and shows how to set the newline convention in the options arguments for the compiling and matching functions.
It is also possible to specify a newline convention by starting a pattern string with one of the following five sequences:
(*CR) carriage return (*LF) linefeed (*CRLF) carriage return, followed by linefeed (*ANYCRLF) any of the three above (*ANY) all Unicode newline sequences
These override the default and the options given to the compiling function. For example, on a Unix system where LF is the default newline sequence, the pattern
(*CR)a.b
changes the convention to CR. That pattern matches “a b” because LF is no longer a newline. If more than one of these settings is present, the last one is used.
The newline convention affects where the circumflex and dollar assertions are true. It also affects the interpretation of the dot metacharacter when PCRE_DOTALL is not set, and the behaviour of \N. However, it does not affect what the \R escape sequence matches. By default, this is any Unicode newline sequence, for Perl compatibility. However, this can be changed; see the description of \R in the section entitled “Newline sequences” below. A change of \R setting can be combined with a change of newline convention.
Setting match and recursion limits
The caller of pcre_exec() can set a limit on the number of times the internal match() function is called and on the maximum depth of recursive calls. These facilities are provided to catch runaway matches that are provoked by patterns with huge matching trees (a typical example is a pattern with nested unlimited repeats) and to avoid running out of system stack by too much recursion. When one of these limits is reached, pcre_exec() gives an error return. The limits can also be set by items at the start of the pattern of the form
(*LIMIT_MATCH=d) (*LIMIT_RECURSION=d)
where d is any number of decimal digits. However, the value of the setting must be less than the value set (or defaulted) by the caller of pcre_exec() for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one setting of one of these limits, the lower value is used.
EBCDIC CHARACTER CODES
PCRE can be compiled to run in an environment that uses EBCDIC as its character code rather than ASCII or Unicode (typically a mainframe system). In the sections below, character code values are ASCII or Unicode; in an EBCDIC environment these characters may have different code values, and there are no code points greater than 255.
CHARACTERS AND METACHARACTERS
A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. When caseless matching is specified (the PCRE_CASELESS option), letters are matched independently of case. In a UTF mode, PCRE always understands the concept of case for characters whose values are less than 128, so caseless matching is always possible. For characters with higher values, the concept of case is supported if PCRE is compiled with Unicode property support, but not otherwise. If you want to use caseless matching for characters 128 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF support.
The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of metacharacters, which do not stand for themselves but instead are interpreted in some special way.
There are two different sets of metacharacters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized within square brackets. Outside square brackets, the metacharacters are as follows:
\ general escape character with several uses ^ assert start of string (or line, in multiline mode) $ assert end of string (or line, in multiline mode) . match any character except newline (by default) [ start character class definition | start of alternative branch ( start subpattern ) end subpattern ? extends the meaning of ( also 0 or 1 quantifier also quantifier minimizer * 0 or more quantifier + 1 or more quantifier also “possessive quantifier” { start min/max quantifier
Part of a pattern that is in square brackets is called a “character class”. In a character class the only metacharacters are:
\ general escape character ^ negate the class, but only if the first character - indicates character range [ POSIX character class (only if followed by POSIX syntax) ] terminates the character class
The following sections describe the use of each of the metacharacters.
BACKSLASH
The backslash character has several uses. Firstly, if it is followed by a character that is not a number or a letter, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes.
For example, if you want to match a * character, you write \ in the pattern. This escaping action applies whether or not the following character would otherwise be interpreted as a metacharacter, so it is always safe to precede a non-alphanumeric with backslash to specify that it stands for itself. In particular, if you want to match a backslash, you write .
In a UTF mode, only ASCII numbers and letters have any special meaning after a backslash. All other characters (in particular, those whose codepoints are greater than 127) are treated as literals.
If a pattern is compiled with the PCRE_EXTENDED option, most white space in the pattern (other than in a character class), and characters between a # outside a character class and the next newline, inclusive, are ignored. An escaping backslash can be used to include a white space or # character as part of the pattern.
If you want to remove the special meaning from a sequence of characters, you can do so by putting them between \Q and . This is different from Perl in that $ and @ are handled as literals in \Q… sequences in PCRE, whereas in Perl, $ and @ cause variable interpolation. Note the following examples:
Pattern PCRE matches Perl matches
\Qabc$xyz abc$xyz abc followed by the contents of $xyz \Qabc\xyz abc\xyz abc\xyz \Qabc\Qxyz abc$xyz abc$xyz
The \Q… sequence is recognized both inside and outside character classes. An isolated that is not preceded by \Q is ignored. If \Q is not followed by later in the pattern, the literal interpretation continues to the end of the pattern (that is, is assumed at the end). If the isolated \Q is inside a character class, this causes an error, because the character class is not terminated.
Non-printing characters
A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is often easier to use one of the following escape sequences than the binary character it represents. In an ASCII or Unicode environment, these escapes are as follows:
alarm, that is, the BEL character (hex 07)
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.