The Treacherous Power of Extended Regular Expressions

 

I wanted to filter out lines containing the word "line" or a double quote from a 1GB file. This can be easily specified as an extended regular expression, but it turns out that I got more than I bargained for.

What I wanted to do was to see all the unique error messages in a huge SQL error log. The log contained lines like

"INSERT INTO FILECOPIES VALUES(2,5127)"
Integrity constraint violation - no parent SYS_FK_123 table: FILES
SQL Error at 'stdin' line 15276525:
To see only the types of error messages included in the file (like "Integrity constraint violation") I wanted to filter out the specific error messages (which contain a double quote character), and the line information (which contains the word line). I therefore used the following pipeline.
egrep -v '"|line' sql.err |
sort -u |
more
More than an hour passed and the command appeared stuck. Running top revealed that grep (on GNU/Linux systems egrep, grep, and fgrep are the same program using different matching algorithms) had already consumed 80 CPU minutes. Running strace on the grep process showed me that it was reading less than 10K every second. This was going to take more than a day to complete.

I reasoned that the problem was the slowness introduced by the | regular expression operator. This is typically implemented using a nondeterministic finite state machine, which uses inefficient backtracking and can even exhibit exponential running times on certain expressions. I therefore rewrote the pipeline using two fgrep invocations:

fgrep -v line sql.err |
fgrep -v \" |
sort -u |
more
This command finished in less than a minute. Knowing your computer science theory can be a time saver.

Comments   Toot! Share


Last modified: Tuesday, August 28, 2007 10:37 am

Creative Commons Licence BY NC

Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.