blog dds: 2007-08-28 — The Treacherous Power of Extended Regular Expressions

I wanted to filter out lines containing the word "line" or a double quote from a 1GB file. This can be easily specified as an extended regular expression, but it turns out that I got more than I bargained for.

What I wanted to do was to see all the unique error messages in a huge SQL error log. The log contained lines like

"INSERT INTO FILECOPIES VALUES(2,5127)"
Integrity constraint violation - no parent SYS_FK_123 table: FILES
SQL Error at 'stdin' line 15276525:

To see only the types of error messages included in the file (like "Integrity constraint violation") I wanted to filter out the specific error messages (which contain a double quote character), and the line information (which contains the word line). I therefore used the following pipeline.


egrep -v '"|line' sql.err |
sort -u |
more

More than an hour passed and the command appeared stuck. Running top revealed that grep (on GNU/Linux systems egrep, grep, and fgrep are the same program using different matching algorithms) had already consumed 80 CPU minutes. Running strace on the grep process showed me that it was reading less than 10K every second. This was going to take more than a day to complete.

I reasoned that the problem was the slowness introduced by the | regular expression operator. This is typically implemented using a nondeterministic finite state machine, which uses inefficient backtracking and can even exhibit exponential running times on certain expressions. I therefore rewrote the pipeline using two fgrep invocations:


fgrep -v line sql.err |
fgrep -v \" |
sort -u |
more

This command finished in less than a minute. Knowing your computer science theory can be a time saver.

Comments Post Toot! Tweet Share

Navigation

blog contents
dds blog
dds home
comments
« Location-Based Dictionary Attacks
» Abstraction and Variation

Tagged as

Become a Unix command line wizard

edX MOOC on Unix Tools: Data, Software, and Production Engineering

Debug like a master

Compute with style

Book cover of The Elements of Computing Style

Syndication

This blog is also available as an RSS feed:

Recent posts

An initial analysis of the discovered Unix V4 tape (2025-12-23)
Why I Choose Email Over Messaging (2025-09-26)
Is it legal to use copyrighted works to train LLMs? (2025-06-26)
I'm removing the BSD advertising clause (2025-05-20)
The perils of GenAI student submissions (2025-04-11)
Unix make vs Apache Airflow (2024-10-15)
How (and how not) to present related work (2024-08-05)
An exception handling revelation (2024-02-05)
Extending the life of TomTom wearables (2023-09-01)
How AGI can conquer the world and what to do about it (2023-04-13)