Regular expression syntax

Regular expression matching patterns

The regular expression syntax used in the Internet Utilities is a subset of the regular expression syntax specified in the POSIX standards:

Any character
Zero or one repetition.
One or more repetitions.
Zero or more repetitions.
Encloses a character set.
Encloses a subexpression.
Separates alternatives.
Quote the next character. This is used to turn metacharacters into literals.
Match the beginning of the line.
Match the end of the line.
Match the beginning of a word.
Match the end of a word.

A pattern comprises a list of branches, separated by | if there is more than one. Each branch comprises a series of consecutive pieces, all of which must match in the given order for the branch as a whole to match. Each piece comprises an ^, $, or an atom followed by an optional suffix operator *, ?, or &plus.. An atom is a character set, a subexpression enclosed in parentheses, ., or a literal character.

Character sets comprise a list of characters, character ranges, and character classes, enclosed within '[' and ']'. A match occurs if the target character is in the set, unless the entire set is prefixed by the '^' character (which is not part of the set itself) in which case a match occurs if the target character is not in the set. A character range is two characters separated by a '-', and includes all characters lexically from the first to the second. A character class is a class name ("alpha", "digit", "alnum", "xdigit", "graph", "space", "print", "upper", "lower", "cntrl", or "punct") enclosed within "[:" and ":]", and denotes all characters in that class. To include the ']' character as part of the set, list it before all other characters. To include the '-' character as part of the set, list it before all other characters except ']'.


Zero or more characters, i.e. any string.
One or more characters, i.e. any non-empty string.
Three characters, the first of which is 'c'.
7 characters, the second of which is 'a' and the sixth of which is 'n' (useful for solving crosswords!)
A sequence of digits that is at least one character long.
Any single alphabetic character, followed by zero or more alphanumeric characters.
An alphanumeric character, a whitespace character, or the '$' character.
Any line beginning with REM or REMARK.
Any line not ending with a semi-colon.
One of the three words "their", "they're", or "there".
\+[0-9]+[- ][- 0-9]+
A telephone number in standard internationalised form (with either spaces or dashes).
\<[A-Za-z][A-Za-z] [0-9]+\>
A U.S.-style "zip" (i.e. postal) code.
A Fidonet address, with optional zone number and point number.

Regular expression substitutions

The Internet Utilities are © Copyright Jonathan de Boyne Pollard. "Moral" rights are asserted.