A Well-Tempered Pipeline
I am studying the use of open source software in industry. One way to obtain empirical data is to look at the operating systems and browsers used by the Fortune 1000 companies by examining browser logs. I obtained a list of the Fortune 1000 domains and wrote a pipeline to summarize results by going through this site's access logs.
My first pipeline was the following.
fgrep -h -f fortune-1000-domains.txt access_log | sed 's/^\([^ ]*\).*"\(.[^"]*\)"$/\1 \2/' | sort | uniq -c >results.txt
lj512272.crawl.yahoo.net - - [01/Jan/2008:00:00:15 +0200] "GET /blog/20040121/index.html HTTP/1.0" 200 4953 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"The sed substitution command works by matching the space-terminated name of the domain, and then matching a sequence of characters within double quotes at the end of the line.
I felt a bit uneasy about the regular expression pattern I used, because you may need to try various solutions until you determine that you have found the pattern at the end of the line. This requires backtracking and can be expensive.
Running top to see where my system was spending its time confirmed my suspicion.
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 63910 dds 1 132 0 1328K 728K RUN 15:01 86.18% sed 63913 dds 1 -8 0 1892K 1160K pipewr 0:09 0.10% fgrepAs you can see, fgrep, which was supposed to be doing the difficult work was spending just 0.1% of the system's time, while the sed's trivial post-processing step was taking 86% of the time. Obviously the pipeline was not properly balanced, and this would surely affect its performance.
I therefore rewrote the sed substitution command in a more imperative style.
sed 's/"$//;s/^\([^ ]*\).*"/\1 /'
This time the balance between the two processes was a lot more even.
PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 63709 dds 1 -8 0 1328K 728K piperd 0:14 46.55% sed 63708 dds 1 -4 0 1892K 1156K getblk 0:03 11.34% fgrepThe performance difference between the two options was startling. The first took 1004 s to complete, while the second one took just 114 s: an order of magnitude improvement. This demonstrates the performance advantage of a well-tempered pipeline. Read and post comments