Regular expression syntax

Regular expression matching patterns

The regular expression syntax used in the Internet Utilities is a subset of the regular expression syntax specified in the POSIX standards:

.
Any character
?
Zero or one repetition.
+
One or more repetitions.
*
Zero or more repetitions.
[]
Encloses a character set.
()
Encloses a subexpression.
|
Separates alternatives.
\
Quote the next character. This is used to turn metacharacters into literals.
^
Match the beginning of the line.
$
Match the end of the line.
\<
Match the beginning of a word.
\>
Match the end of a word.

A pattern comprises a list of branches, separated by | if there is more than one. Each branch comprises a series of consecutive pieces, all of which must match in the given order for the branch as a whole to match. Each piece comprises an ^, $, or an atom followed by an optional suffix operator *, ?, or &plus.. An atom is a character set, a subexpression enclosed in parentheses, ., or a literal character.

Character sets comprise a list of characters, character ranges, and character classes, enclosed within '[' and ']'. A match occurs if the target character is in the set, unless the entire set is prefixed by the '^' character (which is not part of the set itself) in which case a match occurs if the target character is not in the set. A character range is two characters separated by a '-', and includes all characters lexically from the first to the second. A character class is a class name ("alpha", "digit", "alnum", "xdigit", "graph", "space", "print", "upper", "lower", "cntrl", or "punct") enclosed within "[:" and ":]", and denotes all characters in that class. To include the ']' character as part of the set, list it before all other characters. To include the '-' character as part of the set, list it before all other characters except ']'.

Examples:

.*
Zero or more characters, i.e. any string.
.+
One or more characters, i.e. any non-empty string.
c..
Three characters, the first of which is 'c'.
.a...n.
7 characters, the second of which is 'a' and the sixth of which is 'n' (useful for solving crosswords!)
[0-9]+
A sequence of digits that is at least one character long.
[[:alpha:]][[:alnum:]]*
Any single alphabetic character, followed by zero or more alphanumeric characters.
[[:alnum:][:space:]$]
An alphanumeric character, a whitespace character, or the '$' character.
^REM(ARK)?
Any line beginning with REM or REMARK.
[^;]$
Any line not ending with a semi-colon.
\<(their|they're|there)\>
One of the three words "their", "they're", or "there".
\+[0-9]+[- ][- 0-9]+
A telephone number in standard internationalised form (with either spaces or dashes).
\<[A-Za-z][A-Za-z] [0-9]+\>
A U.S.-style "zip" (i.e. postal) code.
([0-9]+:)?[0-9]+/[0-9]+(\.[0-9]+)?
A Fidonet address, with optional zone number and point number.

Regular expression substitutions


The Internet Utilities are © Copyright Jonathan de Boyne Pollard. "Moral" rights are asserted.