Regular Expressions reference
J. Welsh, August 2019
Theory
An RE is a pattern that specifies a (possibly infinite) set of matching strings.
The rules of interpretation answer the primary question, "does this string match this pattern?"
A related question is what substring is matched by a certain part of an RE.
Rules:
- Self-matching: most characters match exactly one instance of themselves.
- Concatenation: if R1 matches s1 and R2 matches s2, then R1 R2 matches s1 s2.
- Alternation ("or"): R1 | R2 matches those strings matched by either R1 or R2 (or both).
- Kleene star ("zero or more"): R* matches a sequence of any number of substrings each matching R.
- Grouping: normally the star has precedence over concatenation, which in turn has precedence over alternation; parentheses specify an explicit precedence.
- abc|def is equivalent to (abc)|(def) but not ab(c|d)ef;
- abc* is equivalent to ab(c*) but not (abc)*.
Practice
In the UNIX utilities there are two major "dialects" of RE: basic (BRE) and extended (ERE). The main difference is in the default meaning of metacharacters. In BREs, more characters are self-matching and must be preceeded by a \ (backslash escaped) to access their special meaning, whereas in EREs some of these characters have special meaning by default and must be backslash escaped to match literally. These include the | ( ) from the theory above (but not *) and those marked (ext) below; regex(7) has the details.
Matching is normally case-sensitive unless otherwise specified.
Various practical shorthands are available:
- Wildcard: . matches any one character.
- Classes
- [abc] matches any one of the enclosed characters (eqv. (a|b|c));
- Complement: [^abc] matches any one character other than the enclosed;
- Ranges: [A-Z], [A-Za-z0-9]
- Line anchoring: ^start ... end$
- Quantifiers (variants on *)
- "One or more": + (ext)
- "Zero or one": ? (ext)
- "Between N and M": {N,M} (and variants) (ext)
Programs supporting RE include:
- grep
- grep -i (case insensitive)
- egrep or grep -E (ERE mode)
- The search command / in less and vim
- The "stream editor" sed
- sed -r (ERE mode)
- awk
Additionally, many programming languages come with RE matching engines, e.g. C, Python, Perl; the latter with its own extended dialect.