Choosing a Collection: A Discussion with Kent Beck
Recently I reviewed the mansucript of Kent Beck's upcoming book Implementation Patterns. I will certainly put it in the list of books any professional programmer should read. When discussing collections (containers in C++ STL parlance), Kent mentions that his overall strategy for performance coding with collections is to use the simplest possible implementation at first and pick a more specialized collection class when it becomes necessary. My view is that we should choose the most efficient implementation from the start. With prepackaged collections this doesn't have any cost associated with it, and it avoids nasty surprises when a dataset increases beyond the size the programmer envisaged. I added a comment to that effect in my review, and later I sent him an email with a supporting citation, which kindled an interesting exchange. I reproduce our email exchange here, with his permission.
From: Diomidis Spinellis To: Kent Beck Subject: Choosing a containerHi Kent,
In choosing a container I commented to use the simplest must efficient available, rather than just the simplest. Today I chanced on a corresponding reference:
Principle 93. Use Optimal Data StructuresDiomidis
From: Kent Beck To: Diomidis Spinellis Subject: Re: Choosing a containerThanks for the note. It certainly is important that your software perform at the level it needs to perform. Ten years ago I too would have said, "use optimal data structures". However, I recently had the experience of writing a graphics editor. From 20 years of practice I know many tricks to make graphics editors efficient. This time, though, I stuck to the simplest possible data structures and algorithms, figuring I could optimize when the time came. Much to my dismay, the editor was already fast enough, even for large drawings. All those neurons wasted... :-(
I like to do "envelope" calculations at the beginning of projects that may have performance problems, then write little prototypes to demonstrate that the technology can handle the load. Because of the ever-increasing resources available, though, I always prefer saving programmer time over saving excess CPU time. I'd probably say something like, "Use the simplest container that's fast enough."
From: Diomidis Spinellis To: Kent Beck Subject: Re: Choosing a containerYour reasoning is fine, if you can be absolutely sure about the size of your program's data set. In a graphics editor maybe this is limited by what can fit on a screen, but even then, are you sure that it will work with acceptable performance when people begin using it for VLSI designs with half a billion elements? I remember early versions of MS-Word worked fine with one-page letters and simple reports, but were unacceptably slow when used on book-size documents. Suboptimal data structures at work. At that time using an optimal data structure was expensive in terms of programmer effort (the neurons you refer to), but today it is simply a matter of choosing the correct name for the container you use.
I'm currently working on a refactoring browser for C program collections. Its footprint when it processes the Linux kernel (4 million lines, 5 million identifiers) is 4GB. When dealing with such numbers you can't afford to use suboptimal data structures or algorithms. Performance will byte you at the slightest provocation. In August I spent a two days to convert a tail-recursive function into a loop. This changed the processing time for the Linux kernel from 2.5 days to the current 4.6 hours. (8300 lines had a pathological macro expansion that took 10 seconds to expand; I can send you a DDJ article I wrote on this).
I think that choosing the appropriate container is the same as choosing matching colors when you dress. You look nicer, and, once in a while, it can make an important difference.
From: Kent Beck To: Diomidis Spinellis Subject: Re: Choosing a containerDiomidis,
When you need a program to scale, it's great to have it scale.
In evaluating the economics of design-for-performance, my experience is that I need to think about four scenarios: I need it to scale vs. I don't need it to scale and It can scale vs. it can't scale. The scenario when the program can't scale but I don't need it to scale is the cheapest. The scenario when the program can scale and I don't need it to scale means that I have incurred costs with no benefits (except the value of the option of scaling up some day). The scenario when the program can scale and I need it to scale is more expensive but more valuable. The scary scenario is when the program can't scale but I need it to scale. Then I need to improve the program. The big wild card in this analysis is that the additional time it takes to make the program scale also has value. In other words, sometimes some results soon have value in themselves, especially if I am exploring an innovative program.
To evaluate the economically-optimal stance towards design-for-performance I need to think about the costs, benefits, and probability of each quadrant, the cost of moving between quadrants, and factor in the value of early results. Evaluating the MS-Word example, for instance, suggests to me that Microsoft did the right thing. They got their product out and generated enough profits to fund performance improvements. If they had waited to release MS-Word until it could handle a book, they might easily have lost the market.
Evaluating my graphic editor, no one has ever used it to edit a VLSI design with a million elements. If I had used quad-tree sorting and all those other fun tricks to make it fast regardless of the number of elements, it would have taken me months to develop instead of days, months I just didn't have for a hobby project. Making it scale past my current needs wasn't an option.
If the containers really are equivalent in the sense that all I have to do is change one line of one configuration file to switch between them, then it doesn't much matter which I choose today as long as it can handle today's load. I can choose a container based on cost, availability, or provider reputation confident that I can fix tomorrow's performance problems tomorrow.
Engineering is about making tradeoffs, including tradeoffs about the time required for engineering. To be a good engineer I need to know all the tricks for extreme situations, but knowing when not to use them (or not to use them yet) is as important as knowing how to use them when I need them. One of the great things about software engineering is that it's also valuable to know how to use tricks one at a time. In software I can choose to get feedback quickly and improve scalability later.
Your refactoring browser is a good example of the value of knowing all the tricks. I'm sure processing the kernel required many necessary engineering techniques. As you mention, though, even you didn't know a priori all the tricks you needed. The program didn't scale far enough, you measured the bottleneck, and you improved the performance further.
I'm not trying to advocate a completely laissez-faire attitude to performance-related design decisions, even though that is what I choose most of the time. I think the question of how and when to design for performance is complicated and can only be illuminated, not solved.