The Treacherous Power of Extended Regular Expressions
I wanted to filter out lines containing the word "line" or a double quote from a 1GB file. This can be easily specified as an extended regular expression, but it turns out that I got more than I bargained for.
What I wanted to do was to see all the unique error messages in a huge SQL error log. The log contained lines like
"INSERT INTO FILECOPIES VALUES(2,5127)" Integrity constraint violation - no parent SYS_FK_123 table: FILES SQL Error at 'stdin' line 15276525:To see only the types of error messages included in the file (like "Integrity constraint violation") I wanted to filter out the specific error messages (which contain a double quote character), and the line information (which contains the word line). I therefore used the following pipeline.
sort -u |
I reasoned that the problem was the slowness introduced by the | regular expression operator. This is typically implemented using a nondeterministic finite state machine, which uses inefficient backtracking and can even exhibit exponential running times on certain expressions. I therefore rewrote the pipeline using two fgrep invocations:
fgrep -v \" |
sort -u |