blog dds: 2006-04-03 — Efficiency Will Always Matter

Many claim that today's fast CPUs and large memory capacities make time-proven technologies that efficiently harness a computer's power irrelevant. I beg to differ, and my experience in the last three days demonstrated that technologies that originated in the 70s still have their place today.

In computing what we can do is often defined and limited by the available resources. Although we could run spreadsheet and word processing programs very effectively on the original IBM PC with its 4.77MHz CPU and 360KB floppy disks, other applications are only available to us thanks to the GHz processors and GB hard disks that we have today. Photorealistic first person shooter games, MPEG movies, Google, and the Wikipedia simply couldn't exist 30 years ago. The corollary of this observation is that no matter what type of computing power we have on our hands there will always be applications that can be realized only by making the best and most efficient use of the available resources.

I realized this fact when I set out to analyze the complete dump of Wikipedia. This file is 20GB in its compressed format, and many times that size uncompressed. Uncompressed it can't fit on the 270GB hard disk of the machine I'm using, it can't be loaded on a screen editor or an IDE, and its XML representation can't fit in the machine's main memory. Therefore, I performed all the processing using pipelines and corresponding tools. The Unix stream editor sed and the regular expression searching program grep were my friends here for locating patterns and examining the file. I also used other tools like sort and uniq to summarize aspects of the file, while head and tail allowed me to process parts of the results.

When the time came to process the full file, I resorted to the flex lexical analyzer generator, C++ STL-based code, and bit vectors. I calculate that the program will take about four days to run and require about 7GB of main memory. Any less efficient technology would make my job a lot harder, if not impossible. Thus Java's 4 or 8 byte overhead for the billions of objects I want to store in memory would prove catastrophic. Although I haven't measured it yet, I bet that the overhead of a proper XML parser would also be prohibitive. Thus, being able to use the most efficient technologies determined the type of problems I could solve.

Don't get me wrong. Every technology has its place. I use Perl every day for tasks that take a few seconds to run, I'd never recommend C++ over Java for implementing an enterprise application, and I appreciate that PHP and Ruby on Rails make web programming painless. Nevertheless, claiming that these technologies are all we need today is myopic. There will always be a need to squeeze all the juice that we can from our machines, and the applications that do that will always be the ones pushing the forefront of what is possible.

Comments Post Toot! Tweet Share