2011-04-28

Perl / Python / Ruby / PHP RegEx implementation VS Go / Golang's implementation


Introduction
This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics:
Perl graph Thompson NFA graph
 
Time to match a?nan against an
Let's use superscripts to denote string repetition, so that a?3a3 is shorthand for a?a?a?aaa. The two graphs plot the time required by each approach to match the regular expression a?nan against the string an.
Notice that Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. The Perl graph plots time in seconds, while the Thompson NFA graph plots time in microseconds: the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string. The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 1015 years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages. A more detailed graph later in this article presents data for other implementations.) [...]
--Russ Cox
...the right graph shows the behavior of Go's RegEx implementation.