Applied Code Reading: Debugging FreeBSD Regex

 

When the code we're trying to read is inscrutable, inserting print statements and running various test cases can be two invaluable tools. Earlier today I fixed a tricky problem in the FreeBSD regular expression library. The code, originally written by Henry Spencer in the early 1990s, is by far the most complex I've ever encountered. It implements sophisticated algorithms with minimal commenting. Also, to avoid code repetition and increase efficiency, the 1200 line long main part of the regular expression execution engine is included in the compiled C code three times after modifying various macros to adjust the code's behavior: the first time the code targets small expressions and operates with bit masks on long integers, the second time the code handles larger expressions by storing its data in arrays, and the third time the code is also adjusted to handle multibyte characters. Here is how I used test data and print statements to locate and fix the problem.

The Bug

The problem described in the original bug report is simple. In the following command

echo "ab" | sed -E "s/(()|.)(b)/[&]/"
the extended regular expression engine fails to capture the "a" character. In contrast, if we change the order of the two choices the result is correct.
echo "ab" | sed -E "s/(.|())(b)/[&]/"
This shouldn't happen. Interestingly, the author first encountered the bug on Mac OS X; this shows how widespread a bug can become.

To experiment with the code without dragging the complete implementation of sed with me, I first wrote a small test harness that takes as arguments a regular expression and a string and prints the matching part.

#include <stdio.h>
#include <regex.h>

int
main(int argc, char *argv[])
{
        regex_t r;
        regmatch_t m[10];
        int i;

        if (argc != 3) {
                fprintf(stderr, "usage: %s regex string\n", argv[0]);
                return (1);
        }
        if (regcomp(&r, argv[1], REG_EXTENDED) != 0) {
                fprintf(stderr, "%s: regcomp of %s failed\n", argv[0], argv[1]);
                return (1);
        }
        if (regexec(&r, argv[2], 10, m, REG_TRACE - REG_TRACE) != 0) {
                fprintf(stderr, "%s: string %s doesn't match RE %s\n", argv[0], argv[2], argv[1]);
                return (1);
        }
        for (i = m[0].rm_so; i < m[0].rm_eo; i++)
                putchar(argv[2][i]);
        return (0);
}
With that harness I could easily replicate the bug in a contained environment.
$ cc -g -o harness -I ../locale  regcomp.c regerror.c regexec.c regfree.c harness.c
$ ./harness '(()|.)(b)' ab
b
$ ./harness '(.|())(b)' ab
ab

Debug Print Statements

I guessed that running a debugger on code that has been included multiple times and macro-defined to death wouldn't be very productive, so I decided to trace the code through print statements. I used to think that debug print statements are for people who can't use a debugger, but over the years I've realized that debug print statements are a valuable tool. I searched through the code for the word print guessing that the original author would have used the same technique, and I was right. I quickly found code like the following.

#ifdef REDEBUG
static void print(struct match *m, char *caption, states st, int ch, FILE *d);
#endif

#ifdef REDEBUG
#define	SP(t, s, c)	print(m, t, s, c, stdout)
#define	AT(t, p1, p2, s1, s2)	at(m, t, p1, p2, s1, s2)
#define	NOTE(str)	{ if (m->eflags&REG_TRACE) printf("=%s\n", (str)); }
Knowing the magic name REDEBUG I immediately recompiled the code with the corresponding macro defined.
$ cc -g -DREDEBUG -o harness -I ../locale  regcomp.c regerror.c regexec.c regfree.c harness.c

The Wrong Character

Sadly, the debugging output appeared cryptic to my eyes.
saft b  8, 9, 10, 11, 12, 13
sboweow \37777777577    8, 9, 10, 11, 12, 13

=dissecting
diss b-\0 1-13
slow b-\0 2-9
sstart b        2
saft b  8, 9
sboweow \37777777577    8, 9
slow \0-\0 9-13
sstart  9
But I had on my hands two very similar test cases, one of which worked correctly. So I decided to compare the output of the two.
CorrectWrong
=dissecting
diss a-\0 1-13
slow a-\0 2-9
sstart a        2
=dissecting
diss b-\0 1-13
slow b-\0 2-9
sstart b        2
As you can see above in the erroneous case the dissection starts from the character b, instead of a. This made me realize that a so-called fast matching routine was refusing to match the a character in the erroneous case.

The Fast Matcher

By adding some additional debugging output to the fast matching routine I guessed that it worked by checking in a loop whether the states reachable from a given input character are different from the states reachable from the previous one. If the two state sets are the same, this means that no progress is being made and the character pointer coldp advances to ignore that character.

	if (EQ(st, fresh))
		coldp = p;
Again I found that the two cases that should have been the same differed.
CorrectWrong
start a         1, 2, 3, 5, 6, 7, 8, 9, 10, 11
Make cold point to a (same states before and after)
boweow a        1, 2, 3, 5, 6, 7, 8, 9, 10, 11
aft a           1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

aft b           1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
start a         1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
Make cold point to a (same states before and after)
boweow a        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
aft a           1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
Make cold point to b (same states, i.e. no progress)
aft b           1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
The above indicates that an additional state (maybe 4?) is incorrectly added in the erroneous case.

The State Transition Bug

So who adds those states? Another routine, step, is, according to its comment, responsible for mapping a set of states reachable before a character to the set reachable after it. This routine works by following the operators of the parsed regular expression and adding the corresponding states to a set which initially contains only the starting state. I added some debugging output, but initially all I got were very large meaningless numbers.

OP=1744830464	1,
OP=2013268920	1, 2
Switching the output to hexadecimal showed me that I was looking at some small bit values located on the number's most significant bits.
OP=68000000 	1,
OP=78000000 	1, 2,
and a quick search for the constants used in the corresponding case labels found me the code that defined them.
#define OPSHIFT ((unsigned)27)
// [...]
#define OANY    (5L<<OPSHIFT)   /* .            -                       */
// [...]
#define OLPAREN (13L<<OPSHIFT)  /* (            fwd to )                */
#define ORPAREN (14L<<OPSHIFT)  /* )            back to (               */
#define OCH_    (15L<<OPSHIFT)  /* begin choice fwd to OOR2             */
#define OOR1    (16L<<OPSHIFT)  /* | pt. 1      back to OOR1 or OCH_    */
#define OOR2    (17L<<OPSHIFT)  /* | pt. 2      fwd to OOR2 or O_CH     */
#define O_CH    (18L<<OPSHIFT)  /* end choice   back to OOR1            */
This (and a programmer's calculator) allowed me to hand-annotate the debugging output with the symbolic names of the corresponding operator symbols.
CorrectWrong
OP=68000000 O_LPAREN	1,
OP=78000000 O_CH_	1, 2,
OP=28000000 OANY	1, 2, 3, 5,
OP=80000000 OOR1	1, 2, 3, 5,
OP=88000000 OOR2	1, 2, 3, 5,
OP=68000000 OLPAREN	1, 2, 3, 5, 6,
OP=70000000 ORPAREN	1, 2, 3, 5, 6, 7,
OP=90000000 O_CH	1, 2, 3, 5, 6, 7, 8,
OP=70000000 O_RPAREN 	1, 2, 3, 5, 6, 7, 8, 9,

OP=68000000 		1, 2, 3, 5, 6, 7, 8, 9, 10,
OP=10000000 		1, 2, 3, 5, 6, 7, 8, 9, 10, 11,
OP=70000000 		1, 2, 3, 5, 6, 7, 8, 9, 10, 11,
OP=68000000 O_LPAREN	1,
OP=78000000 O_CH_	1, 2,
OP=68000000 OLPAREN	1, 2, 3, 6,
OP=70000000 ORPAREN	1, 2, 3, 4, 6,
OP=80000000 OOR1	1, 2, 3, 4, 5, 6,
OP=88000000 OOR2	1, 2, 3, 4, 5, 6, 8,
OP=28000000 OANY	1, 2, 3, 4, 5, 6, 7, 8,
OP=90000000 O_CH 	1, 2, 3, 4, 5, 6, 7, 8,
OP=70000000 O_RPAREN	1, 2, 3, 4, 5, 6, 7, 8, 9,

OP=68000000 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
OP=10000000 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
OP=70000000 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
In the above output O_CH_ stands for the start of a choice subexpression, OOR1 and OOR2 for the points immediately before and after the operator, and O_CH for the end of the choice subexpression. From the behavior of fast I already knew I was chasing a rogue state. So I noticed in the above that in the erroneous case an additional state was added after OOR1.

The code handling this case certainly seemed tricky:

if (ISSTATEIN(aft, here)) {
	for (look = 1; OP(s = g->strip[pc+look]) != O_CH; look += OPND(s))
		assert(OP(s) == OOR2);
	FWD(aft, aft, look);
}
I assumed that the FWD macro set the state corresponding to O_CH, but I found the value of look fishy for that purpose; the start value of 1 ringed to me an off-by-one alarm bell. For the small-sized regular expressions FWD is implemented with bit twiddling operators
#define FWD(dst, src, n) ((dst) |= ((unsigned long)(src)&(here)) << (n))
so I added corresponding debug statements in the OOR1 and O_CH cases. The output was enlightening.
CorrectWrong
OP=68000000 O_LPAREN	1,
OP=78000000 O_CH_	1, 2,
OP=28000000 OANY	1, 2, 3, 5,
OP=80000000 OOR1	1, 2, 3, 5,


OP=88000000 OOR2	1, 2, 3, 5,
OP=68000000 OLPAREN	1, 2, 3, 5, 6,
OP=70000000 ORPAREN	1, 2, 3, 5, 6, 7,
OP=90000000 O_CH	1, 2, 3, 5, 6, 7, 8,
OCH FWD set=100 << 1
OP=70000000 O_RPAREN 	1, 2, 3, 5, 6, 7, 8, 9,

OP=68000000 		1, 2, 3, 5, 6, 7, 8, 9, 10,
OP=10000000 		1, 2, 3, 5, 6, 7, 8, 9, 10, 11,
OP=70000000 		1, 2, 3, 5, 6, 7, 8, 9, 10, 11,
OP=68000000 O_LPAREN	1,
OP=78000000 O_CH_	1, 2,
OP=68000000 OLPAREN	1, 2, 3, 6,
OP=70000000 ORPAREN	1, 2, 3, 4, 6,
OP=80000000 OOR1	1, 2, 3, 4, 5, 6,
look for O_CH
OOR1 FWD set=20 << 3
OP=88000000 OOR2	1, 2, 3, 4, 5, 6, 8,
OP=28000000 OANY	1, 2, 3, 4, 5, 6, 7, 8,
OP=90000000 O_CH 	1, 2, 3, 4, 5, 6, 7, 8,
OCH FWD set=100 << 1
OP=70000000 O_RPAREN	1, 2, 3, 4, 5, 6, 7, 8, 9,

OP=68000000 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
OP=10000000 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
OP=70000000 		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
After some head scratching I noticed that 20 << 3 == 100 whereas 100 << 1 == 200, so an additional wrong state was being introduced.

Fix and Test

In most cases, once you've found a bug the fix is easy. I simply added a + 1 to the code

if (ISSTATEIN(aft, here)) {
	for (look = 1; OP(s = g->strip[pc+look]) != O_CH; look += OPND(s))
		assert(OP(s) == OOR2);
	FWD(aft, aft, look + 1);
}
to make it set the state corresponding to O_CH. This indeed corrected the behavior for the specific problem.

But how could I be sure that my fix did not break something else? Fortunately, this regex implementation contains a regression test suite with a few hundreds of test cases. So, in the test specification file I added two lines corresponding to the two test cases of the bug report

# PR 130504
(.|())(b)       -       ab      ab
(()|.)(b)       -       ab      ab
and then removed my fix to verify that the test would indeed fail.
$ make
sh mkh  -i _REGEX_H_ ../regex2.h ../reg*.c >regex.tmp
[...]
./re < tests
477: ERE matched `b' instead
*** Error code 1
Finally, I reintroduced the fix and was happy to see that all the tests passed without a problem.
$ make
sh mkh  -i _REGEX_H_ ../regex2.h ../reg*.c >regex.tmp
[...]
./re < tests
./re -el < tests
./re -er < tests

In the end I guess I hadn't read more than 10% of the 3400 lines of the code. My essential problem-solving tools were running the code, adding debug print statements, and differential debugging.

Comments   Toot! Share


Last modified: Wednesday, September 16, 2009 9:44 am

Creative Commons Licence BY NC

Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.