blog dds: 2009-01-25 — A Well-Tempered Pipeline

I am studying the use of open source software in industry. One way to obtain empirical data is to look at the operating systems and browsers used by the Fortune 1000 companies by examining browser logs. I obtained a list of the Fortune 1000 domains and wrote a pipeline to summarize results by going through this site's access logs.

My first pipeline was the following.


fgrep -h -f fortune-1000-domains.txt access_log |
sed 's/^\([^ ]*\).*"\(.[^"]*\)"$/\1 \2/' |
sort |
uniq -c >results.txt

The Unix fgrep command is optimized for searching its input for (a possibly large number of) fixed strings. To perform the search fgrep builds a data structure that looks at each character of its input at most once, even when matching hundreds of patterns. The sed command I used takes each matching log line and prints from it only the domain name and the client details (browser and operating system). A typical web browser log line looks as follows.

lj512272.crawl.yahoo.net - - [01/Jan/2008:00:00:15 +0200]
"GET /blog/20040121/index.html HTTP/1.0" 200 4953 "-"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

The sed substitution command works by matching the space-terminated name of the domain, and then matching a sequence of characters within double quotes at the end of the line.

I felt a bit uneasy about the regular expression pattern I used, because you may need to try various solutions until you determine that you have found the pattern at the end of the line. This requires backtracking and can be expensive.

Running top to see where my system was spending its time confirmed my suspicion.

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
63910 dds           1 132    0  1328K   728K RUN     15:01 86.18% sed
63913 dds           1  -8    0  1892K  1160K pipewr   0:09  0.10% fgrep

As you can see, fgrep, which was supposed to be doing the difficult work was spending just 0.1% of the system's time, while the sed's trivial post-processing step was taking 86% of the time. Obviously the pipeline was not properly balanced, and this would surely affect its performance.

I therefore rewrote the sed substitution command in a more imperative style.


sed 's/"$//;s/^\([^ ]*\).*"/\1 /'

This removes the final quotation mark, and then greedily replaces the client's address and all text up to the final quotation mark with just the client's address.

This time the balance between the two processes was a lot more even.

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
63709 dds           1  -8    0  1328K   728K piperd   0:14 46.55% sed
63708 dds           1  -4    0  1892K  1156K getblk   0:03 11.34% fgrep

The performance difference between the two options was startling. The first took 1004 s to complete, while the second one took just 114 s: an order of magnitude improvement. This demonstrates the performance advantage of a well-tempered pipeline.

Comments Post Toot! Tweet Share