Regular Expressions reference

J. Welsh, August 2019

Theory

An RE is a pattern that specifies a (possibly infinite) set of matching strings. The rules of interpretation answer the primary question, "does this string match this pattern?" A related question is what substring is matched by a certain part of an RE.

Rules:

  1. Self-matching: most characters match exactly one instance of themselves.
  2. Concatenation: if R1 matches s1 and R2 matches s2, then R1 R2 matches s1 s2.
  3. Alternation ("or"): R1 | R2 matches those strings matched by either R1 or R2 (or both).
  4. Kleene star ("zero or more"): R* matches a sequence of any number of substrings each matching R.
  5. Grouping: normally the star has precedence over concatenation, which in turn has precedence over alternation; parentheses specify an explicit precedence.

Practice

In the UNIX utilities there are two major "dialects" of RE: basic (BRE) and extended (ERE). The main difference is in the default meaning of metacharacters. In BREs, more characters are self-matching and must be preceeded by a \ (backslash escaped) to access their special meaning, whereas in EREs some of these characters have special meaning by default and must be backslash escaped to match literally. These include the | ( ) from the theory above (but not *) and those marked (ext) below; regex(7) has the details.

Matching is normally case-sensitive unless otherwise specified.

Various practical shorthands are available:

Programs supporting RE include:

Additionally, many programming languages come with RE matching engines, e.g. C, Python, Perl; the latter with its own extended dialect.