Regular expressions

I searched for how to implement capturing groups online, and couldn't easily find the algorithm anywhere in a human-readable form, so I worked it out for myself and decided to post it here, so others can find it. If you want to make comments on it, send me an e-mail.

Regular expressions are fundamental to computer science. They provide a convenient way to search for simple patterns in text. On their own, they just say whether some text (a string) matches the expression or not. It doesn't say anything about which parts of the string correspond to which parts of the regular expression. It works by converting the expression into a DFA and feeding the string to it 1 character at a time. The DFA will be in an accepting state whenever the string so far matches. At the end of the string, we know if the whole string matches by if it is in an accepting state at the end.

Of course, it is useful to know which parts of the regular expression correspond to which part of the string. This is where capturing groups come in. A capturing group is a sequence of regular expressions where a consistant set of divisions between the regular expressions is returned, allowing you to chop it up into sections. With this ability, you can easily extract fields from a document, or substitute text with other text, for example.

To make it more concrete, say you have strings of the form:

Hello, my name is Mike.

This could be divided into 3 regular expressions as:

A="Hello, my name is "
B=".*"   # Matches every string (the name)
C="[.]"  # Matches ending period 

Concatenating the regular expressions A, B, and C together, we would get that the above string matches, but we would not be able to figure out what the name of the person is. To find the name, we would need to know where the division between A and B and the division between B and C is. Then, we could extract the name as matching regular expression B. With capturing groups, it would look like this, using "|" to denote the divisions:

|        A         | B  |C|
|Hello, my name is |Mike|.|

Many modern regular expression systems work by backtracking, which is very slow, instead of compiling to a DFA or something similar. However, using backtracking makes implementing capturing groups straight forward, and also allows you to add features that are not possible in regular expressions, such as back references.

Capturing groups algorithm

The basic question is: given 2 regular expressions, how can we find a suitable place to put a division between them? Let's call the first regular expression A, and the second B. If we compile A and run it through, we will obtain a list of all of the places A could end, but we still don't know where B could start. The trick here is to reverse B to get Br. The reverse of a regular expression is still a regular expression. So, Br starts from the end of the expression and will find places where B can start. Taking the intersection of A and Br, we have the list of places that A can end and B can start, as in the division between A and B. Exactly what we were looking for! If it turns out that there are no such places, the list will be empty, and we can promptly return no match.

For capturing groups, usually there are many such divisions that need to be found. This can be done by concatenating all of the regular expressions on each side of the division, finding that division, then repeating for each division on each side until no divisions are left. For example, say there are 4 regular expressions: ABCD. We can use AB and (CD)r to find the middle division. Then, on the left, we can use A and Br to find the first division. On the right, we can use C and Dr to find the last division. This yields all 3 divisions. Note that after the first division is found, it is guaranteed that the other divisions can be found by how they have already been matched. For example, the first expression matched ABCD, so after finding the middle division, the first part must match AB, meaning that there is an A, followed by a B. Therefore, there is a place to put the division between them.

For capturing groups, there is a concept called greedy matching. This means that a repetition (denoted with "*") will match as many as it can. Similarly, ungreedy (denoted with "*?") will match as few as possible. The algorithm for finding a division above will find all such possible positions for a division. So, to make it greedy, take the last possible position. To make it ungreedy, take the first position.

Combining all of this gives a lot of flexibility. You can find the divisions in any order, and can do greedy, ungreedy, or anything in between for each. How you go about it is determined by what you are aiming for. The most human-friendly way would probably be to go left-to-right. This way you can specify min, max, etc. for each division and have it match that way and easily follow what it should do. For min and max, you can easily abort the search, as soon as the first is found, as opposed to more complex rules, which might require the full list of possibilities to be found, such as taking the median position.

Running time

Let N be the length of the string being searched and D be the number of divisions to be found. It is assumed that the regular expressions are already compiled. If we use the strategy of always searching for the median division, we will search through the whole string up to ceil(lg(D+1)) times in both directions. (lg is the logarithm base 2.) This is because the number of divisions to be found on both sides is cut in half on each step, but between both sides, the whole string will have to be searched again. If a bit array is used to store the matching positions in 1 direction, and we are restricted to max and min, we can exit as soon as the 2nd pass matches when the 1st pass marked that position. This means that it would take between N and 2N character reads to complete both passes to find 1 division. So, the total running time would be O(N*ceil(lg(D+1))). Furthermore, in the case that there is no match, the first pass would return an empty list. So, it will fail fast and take just O(N) time.

If, instead, you were finding the divisions from left-to-right, you would have to search for the last regular expression each time. In the worst case, this is the majority of the string, making the search take O(N*D) time.

Given this large difference in time needed, the simplest thing to do would be to use left-to-right semantics so it will be easy for a human to understand, but in addition to min and max, allow a third option "any". This would allow the implementation to group the options together and pick arbitrarily if one was faster, and thus group together the regular expressions as in the faster approach to the extent that it wouldn't hurt the meaning of the regular expressions. This should give a good comprimise between the 2 approaches.