Regular Expressions reference

J. Welsh, August 2019

Theory

An RE is a pattern that specifies a (possibly infinite) set of matching strings. The rules of interpretation answer the primary question, "does this string match this pattern?" A related question is what substring is matched by a certain part of an RE.

Rules:

Self-matching: most characters match exactly one instance of themselves.
Concatenation: if R1 matches s1 and R2 matches s2, then R1 R2 matches s1 s2.
Alternation ("or"): R1 | R2 matches those strings matched by either R1 or R2 (or both).
Kleene star ("zero or more"): R* matches a sequence of any number of substrings each matching R.
Grouping: normally the star has precedence over concatenation, which in turn has precedence over alternation; parentheses specify an explicit precedence.
- abc|def is equivalent to (abc)|(def) but not ab(c|d)ef;
- abc* is equivalent to ab(c*) but not (abc)*.

Practice

In the UNIX utilities there are two major "dialects" of RE: basic (BRE) and extended (ERE). The main difference is in the default meaning of metacharacters. In BREs, more characters are self-matching and must be preceeded by a \ (backslash escaped) to access their special meaning, whereas in EREs some of these characters have special meaning by default and must be backslash escaped to match literally. These include the | ( ) from the theory above (but not *) and those marked (ext) below; regex(7) has the details.

Matching is normally case-sensitive unless otherwise specified.

Various practical shorthands are available:

Wildcard: . matches any one character.
Classes
- [abc] matches any one of the enclosed characters (eqv. (a|b|c));
- Complement: [^abc] matches any one character other than the enclosed;
- Ranges: [A-Z], [A-Za-z0-9]
Line anchoring: ^start ... end$
Quantifiers (variants on *)
- "One or more": + (ext)
- "Zero or one": ? (ext)
- "Between N and M": {N,M} (and variants) (ext)

Programs supporting RE include:

grep
grep -i (case insensitive)
egrep or grep -E (ERE mode)
The search command / in less and vim
The "stream editor" sed
sed -r (ERE mode)
awk

Additionally, many programming languages come with RE matching engines, e.g. C, Python, Perl; the latter with its own extended dialect.