Google RE2

Regular expressions (regex) provide a means for matching text, whether characters, words, or patterns. They can be infinitely expressive and yet typically compact. An example is "[hc]at", which would match "hat" and "cat".

The implementation of regular expressions in almost all programming languages today allow for backreferences. Backreferences allow you to reuse part of the regex match later in the regular expression. For example "<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>" would match any opening and closing html tag and the text in-between. The "\1" is used to reference the part of the regex in parenthesis, in this case, the type of tag we're trying to match. Thus, we can ensure that the we get the same type of beginning and ending tags.

Backtracking sounds really cool, but we run into a problem. Regular expressions containing backreferences have a potential for exponential run time and unbounded stack (memory) usage. Unbounded stack usage in typical regular expression implementations leads to stack overflows and server crashes and all sorts of terrible things. That is to say, under certain circumstances, evaluating a regular expression over a large enough input could hang forever, use up all memory allocated to it, or both. This is a very important problem in Google's case.

Enter bright computer scientists. Basing their approach on automata and computational theory (I took that class!), Google researchers have created RE2, a "principled approach to regular expressions". It claims the implementation guarantees that searches complete in linear time with respect to the size of the input and won't overflow the stack space. They have limited the worst case runtime by removing support for backexpressions, however.

In his paper "Regular Expression Matching Can Be Simple And Fast", Russ Cox, the Google engineer who also authored the press release, writes, "Given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs, the decision should be easy." For Google, yes. For those who like fancy syntactic sugar, it's still great to have another option.

Sources and Further Reading:
Google RE2 Press Release
The Command Line
Code Repository: RE2 on Google Code