blog dds

2014.09.25

First, Do No Harm

Let’s face it: not all software developers are superstar programmers (and, trust me, not all luminary developers program in a sane way.) This means that when we maintain existing code, we must be very careful to avoid breaking or degrading the system we work on. Why? Because a failure of a running system can affect operations, people, profits, property, and sometimes even lives. Here are the rules.

Continue reading "First, Do No Harm"

2014.07.30

Service Orchestration with Rundeck

Increasingly, software is provided as a service. Managing and controlling the service’s provision is tricky, but tools for service orchestration, such as Rundeck, can make our lives easier. Take software deployment as an example. A well-run IT shop will have automated both the building of its software using tools like make, Ant, and Maven and the configuration of the hosts the software runs on with CFEngine, Chef, or Puppet (see the post “Don’t Install Software by Hand”). Furthermore, version control tools and continuous integration will manage the software and the configuration recipes, handling developer contributions, reviews, traceability, branches, logging, and sophisticated workflows. However, these tools still leave a gap between the software that has been built and is ready to deploy, and the server that has been configured with the appropriate components and libraries and is ready to run the software.

Continue reading "Service Orchestration with Rundeck"

2014.04.24

Developing in the Cloud

Running a top-notch software development organization used to be a capital-intensive endeavor, requiring significant technical and organizational resources, all managed through layers of bureaucracy. Not anymore. First, many of the pricey systems and tools that we developers need to work effectively are usually available for free as open source software. More importantly, cheap, cloud-based offerings do away with the setup, maintenance, and user support costs and complexity associated with running these systems. Here are just a few of the services and providers that any developer group can easily tap into (you can find many more listed here):

  • Bitbucket and GitHub for version control;
  • Asana, Basecamp, FogBugz, GitHub, JIRA, Pivotal Tracker, Redmine, Trello, and YouTrack for issue tracking, project management, and collaboration;
  • BugSense, Crittercism, Exceptional, New Relic, and Sentry for remote application monitoring;
  • Pootle, Transifex, and WebTranslateIt for localization;
  • Google Hangouts, Grove.io, HipChat, Skype, and sqwiggle for r-real-time communication;
  • Gmail, Outlook.com, and Yahoo! Mail for email;
  • Confluence, Bitbucket, and GitHub for wiki work;
  • Etherpad, Google Docs, Stypi, and Microsoft Office 365 for collaborative editing;
  • Box, DropBox, Google Drive, and SkyDrive for file sharing;
  • Amazon Web Services, Google App Engine, Heroku, Microsoft Azure, Nodejitsu, and Rackspace for deployment servers;
  • Google Play Store, iOS / Mac App Store, and Windows Store for store fronts;
  • Braintree, Chargify, PayPal, and Stripe for payment processing;
  • Desk.com, Helpscout, and Zendesk for IT support;
  • CampaignMonitor, Mailchimp, mailgun, and Sendgrid for bulk email campaigns;
  • Optimizely for A/B testing;
  • Cloud9 and Visual Studio Cloud for programming; and
  • Bamboo, Cloudbees, and Codeship for continuous integration.
In addition, by hosting development servers on the cloud or by adopting the Vagrant virtual development environment configurator, a modern software shop can do away with the headache of setting up programmer workstations. Developers can simply use their own favorite device to connect to the team’s cloud-based or virtualized preconfigured development setup.

The Meaning of Clouds

Life in the cloud has risks and costs different from those you face on Earth. However, dealing with them is rarely a big deal, especially for smaller organizations, which often face greater risks by juggling resources to provide similar services in-house. The most important issues concern control of the data you store and the services you use. In short, don’t bet your farm on a single obscure service provider. Select popular providers that support open data formats and that make your data available through standard protocols and APIs. This allows you to move elsewhere with minimal disruption if the service faces severe problems or if the cloud-based provider dissolves into thin air, as it were. Recognize vendor locking strategies and plan around them. Then you can have the peace of mind to enjoy the cloud’s many benefits.

Continue reading "Developing in the Cloud"

2014.01.15

Bespoke Infrastructures

In the 1920s, the Ford Motor Company embarked on an ill-fated attempt to establish an industrial town in an Amazon rainforest as a way to secure a cultivated rubber supply for its cars’ wheels. At the time, it already owned ore mines, forests, and a steel foundry to produce the raw materials for its cars; today, it buys from external suppliers, even its cars’ electronic control units. How do these two phases of the automotive industry’s history relate to the way we currently develop and adopt infrastructure in our profession?

Continue reading "Bespoke Infrastructures"

2013.12.06

The Frictionless Development Environment Scorecard

The environment we work in as developers can make a tremendous difference on our productivity and well-being. I’ve often seen myself get trapped in an unproductive setup through a combination of inertia, sloth, and entropy. Sometimes I put-off investing in new, better tools, at other times I avoid the work required to automate a time-consuming process, and, also, as time goes by, changes in my environment blunt the edge of my setup. I thus occasionally enter into a state where my productivity suffers death by a thousand cuts. I’ve also seen the same situation when working with colleagues: cases where to achieve a simple task they waste considerable time and energy jumping through multiple hoops.

Continue reading "The Frictionless Development Environment Scorecard"

2013.09.10

Differential Debugging

If estimating the time needed for implementing some software is difficult, coming up with a figure for the time required to debug it is nigh on impossible. Bugs can lurk in the most obscure corners of the system, or even in the crevices of third-party libraries and components. Ask some developers for a time estimate, and don’t be surprised if an experienced one snaps back, “I’ve found the bug when I’ve found the bug.” Thankfully, there are some tools that allow methodical debugging, thereby giving you a sense of progress and a visible target. A method I’ve come to appreciate over the past few months is differential debugging. Under it, you compare a known good system with the buggy one, working toward the problem source.

Finding yourself in a situation with both a working and a buggy system is quite common. It might happen after you implement some new functionality, when you upgrade your tools or infrastructure, or when you deploy your system on a new platform. In all these cases, you might find yourself facing a system that should have been working but is behaving erratically for some unknown reason.

Continue reading "Differential Debugging"

2013.07.25

Portability: Goodies vs. the hair shirt

“I don’t know what the language of the year 2000 will look like, but I know it will be called Fortran”

Continue reading "Portability: Goodies vs. the hair shirt"

2013.05.08

Systems Software

Systems software is the low-level infrastructure that applications run on: the operating systems, language runtimes, libraries, databases, application servers, and many other components that churn our bits 24/7. It’s the mother of all code. In contrast to application software, which is constructed to meet specific use cases and business objectives, systems software should be able to serve correctly any reasonable workload. Consequently, it must be extremely reliable and efficient. When it works like that, it’s a mighty tool that lets applications concentrate on meeting their users’ needs. When it doesn’t, the failures are often spectacular. Let’s see how we go about creating such software.

Writing

As an applications programmer, the first rule to consider when writing a vitally required piece of systems software is “don’t.” To paraphrase the unfortunate 1843 remark of the US Patent Office Commissioner Henry Ellsworth, most of the systems software that’s required has already been written. So, discuss your needs with colleagues and mentors, aiming to pin down the existing component that would fit your needs. The component could be a message queue manager, a data store, an embedded real-time operating system, an application server, a service bus, a distributed cache—the list is endless. The challenge is often simply to pin down the term for the widget you’re looking for.

Continue reading "Systems Software"

2013.03.14

Software Tools Research: SPLASH Panel Discussion

Written by Dennis Mancl and Steven Fraser

At the recent SPLASH (Systems, Programming, Languages and Applications: Software for Humanity) conference, one of us (Steven Fraser) organized an international group of experts to discuss challenges in software tools research.1 The panelists included Kendra Cooper (University of Texas, Dallas), Jim “Cope” Coplien (Gertrud & Cope), Junilu Lacar (Cisco Systems), Ruth Lennon (Letterkenny Institute of Technology), Diomidis Spinellis (Athens University of Economics and Business), and Giancarlo Succi (Free University of Bolzano-Bozen).

Continue reading "Software Tools Research: SPLASH Panel Discussion"

2013.01.23

The Importance of Being Declarative

A declarative programming style focuses on what you want your program to do rather than how to perform the task. Through diverse programming techniques, libraries, and specialized languages, you end up with code that sidesteps nitty-gritty implementation details, dealing instead with a task’s big picture.

Continue reading "The Importance of Being Declarative"

2012.12.19

APIs, Libraries, and Code

Let’s say you want to display a JPEG-compressed image, calculate Pearson’s correlation coefficient, parse an XML file, or create a key-value store. You can often choose between using the functionality of the application’s platform (Java EE or .NET), calling one of several available external libraries, or writing the code on your own. It isn’t an easy choice because you have many factors to consider. Specifically, you must take into account the task’s complexity, as well as the licensing, quality, and support of competing alternatives. See how you can narrow down your choice by eliminating alternatives at the earliest possible decision point.

Where to Start?

There are clear advantages in writing your own code: you control its quality, your code plays well with the rest of the system (and you can even reuse other parts of it), you don’t introduce new dependencies, and you don’t need to make special arrangements to support the code. The main deciding factor here is the task’s complexity. You’re getting paid to deliver end results, not to reinvent the wheel. Unless the task is trivial to implement, professionalism dictates to look at existing solutions. Hand-crafting code to find the biggest number in a sequence is okay if a corresponding function isn’t directly available in your environment. On the other hand, unless you work for a game studio or Pixar, building a 3D rendering engine from scratch is definitely a no-go area.

Continue reading "APIs, Libraries, and Code"

2012.10.11

Virtualize Me

The virtual machine (VM) is the most dazzling comeback in information technology. IBM implemented a VM platform architecture in the late 1960s in its CP/CMS operating system. The company’s goal was to provide the time-sharing capabilities that its batch-oriented System/360 lacked. Thus a simple control program (CP) created a VM environment where multiple instances of the single-user CMS operating system could run in parallel. Thirty years later, virtualization was rediscovered when companies like VMware found ways to virtualize the less accommodating Intel x86 processor architecture. The popularity of Intel’s platform and the huge amount of software running on it made virtualization an attractive proposition, spawning within a decade tens of proprietary and open source virtualization platforms.

Virtualization has progressed a lot from its primitive beginnings. Today, virtualization lets us run most modern operating systems in a VM that can be hosted on facilities ranging from our laptop to a datacenter in the cloud. Once an operating system runs as a guest in a VM host (also known as hypervisor), it becomes easy to control via high-level operations. You can save its image into a file, move it from one host to another, launch multiple clones, suspend it until it’s needed, share it with others, rent it as a service, or ship it to a customer. Techniques such as paravirtualization, in which the guest offloads some heavy lifting to the hypervisor, allow the guest systems to run efficiently and share hardware with a minimal waste of resources. Virtualization means never having to beg for a server.

Continue reading "Virtualize Me"

2012.09.04

Don't Install Software by Hand

An IT system’s setup and configuration is a serious affair. It increasingly affects us developers mainly due to the proliferation and complexity of internet-facing systems. Fortunately, we can control and conquer this complexity by adopting IT-system configuration management tools.

Veni, Vidi, Vici

Start with the complexity of Internet-facing systems. In the past, most software applications were monolithic contraptions that could be installed in a basic standard environment. For enterprise and end-user applications, the environment would be the operating system; for embedded systems, it would be the underlying hardware. All we developers had to do was test the software’s deployment procedures and manage our software’s configuration with a version control system, so that we could deliver a known baseline to the customer for installation. Internet-facing systems consist of many parts that we can no longer control as a monolithic block of software. Availability and performance requirements drive the adoption of application servers, load-balancing solutions, relational database management systems, and disaster recovery setups, while interoperability requirements and complex standards make us use a multitude of third-party libraries and online services.

Continue reading "Don't Install Software by Hand"

2012.05.17

Git

Even by our field’s dizzying rate of progress I wouldn’t expect to revisit the subject of version control just six years after I first wrote about it in this column (Version Control Systems. Software, 22(5):108–109, September/October 2005). Yet here we are. The new kid on the block is git, a distributed revision control system available on all mainstream development platforms through a Free Software license. Git, a brainchild of Linus Torvalds, began its life in 2005 as the revision management system used for coordinating the development of the Linux kernel. Over the years its functionality, portability, efficiency, and third-party adoption have evolved by leaps and bounds to make it its category’s leader. (Two other systems with similar characteristics are Mercurial and Bazaar.)

Continue reading "Git"

2012.03.08

Package Management Systems

DLL hell was a condition that often afflicted unfortunate users of old Microsoft Windows versions. Under it, the installation of one program would render others unusable due to incompatibilities between dynamically linked libraries. Suffering users would have to carefully juggle their conflicting DLLs to find a stable configuration. Similar problems distress any administrator manually installing software that depends on incompatible versions of other helper modules.

Thankfully, these problems are now ancient history for most of us thanks to the success of package management systems, which organize and simplify software installation and maintenance by standardizing and organizing the production and consumption of software collections. Many modern operating systems and extensible applications use packages as the default software installation option; Table 1 lists commonly used ones.

Table 1. Commonly used package management systems

Continue reading "Package Management Systems"

2012.01.11

Refactoring on the Cheap

The refactorings that a good integrated development environment can perform are impressive. Yet, there are many reasons to master some cheap-and-cheerful alternative approaches. First, there will always be refactorings that your IDE won’t support. Also, although your IDE might offer excellent refactoring support for some programming languages, it could fall short on others. Modern projects increasingly mix and match implementation languages, and switching to a specialized IDE for each language is burdensome and inefficient. Finally, IDE-provided refactorings resemble an intellectual straightjacket. If you only know how to use the ready-made refactorings, you’ll miss out on opportunities for other code improvements.

In this column, I describe how you can harness the sophistication of your editor and the power of command-line tools to perform many simple refactorings on your own. As a bonus, you’ll see that you can write and run most of them in less time than is required to fire up your favorite IDE.

Continue reading "Refactoring on the Cheap"

2011.10.30

Lessons from Space

By Diomidis Spinellis and Henry Spencer

Continue reading "Lessons from Space"

2011.09.11

Faking it

This column is about a tool we no longer have: the continuous rise of the CPU clock frequency. We were enjoying this trend for decades, but in the past few years, progress stalled. CPUs are no longer getting faster because their makers can’t handle the heat of faster-switching transistors. Furthermore, increasing the CPU’s sophistication to execute our instructions more cleverly has hit the law of diminishing returns. Consequently, CPU manufacturers now package the constantly increasing number of transistors they can fit onto a chip into multiple cores—processing elements—and then ask us developers to put the cores to good use.

Continue reading "Faking it"

2011.07.03

Agility Drivers

When the facts change, I change my mind. What do you do, sir?

Continue reading "Agility Drivers"

2011.05.01

Choosing and Using Open Source Components

The developers of the SQLite open source database engine estimate that it’s deployed in roughly half a billion systems around the world (users include Airbus, Google, and Skype). Think of the hundreds of thousands of open source components, just one click away from you. If you know how to choose and use them effectively , your project can benefit mightily.

Choosing

Say you’re looking for a library to evaluate regular expressions, an embeddable scripting language, or an HTML rendering engine, but your search returns a dozen candidates. How do you choose among them? It’s actually quite easy: all you have to do is select the best one based on a few simple criteria associated with the software’s legal status, fitness, and quality.

Continue reading "Choosing and Using Open Source Components"

2011.02.27

elytS edoC

Sure, you can write English right to left. You can also write software code to look like a disc or even a train (see www.ioccc.org/1988/westley.c and 1986/marshall.c). However, you can’t then complain when you have to fight with your magazine’s editor or production staff about accepting your column’s title for publication, or if your colleagues refuse to touch your code with a 10-foot pole. Writing code in a readable and consistent style is difficult, uninteresting, tedious, underappreciated, and, extremely important.

Why

Our code’s style encompasses formatting, things like indentation and spacing, commenting, program element order, and identifier names. Although most style choices won’t affect the compiled code or the program’s runtime behavior, style is a key aspect of the code’s maintainability. And because we write code once, but over its life, we read it many times, it pays to keep our code in a style that’s easy to analyze, comprehend, review, test, and change.

Continue reading "elytS edoC"

2010.10.30

Farewell to Disks

A classic web-comic illustrates how idle Wikipedia browsing can lead us from the Tacoma Narrows Bridge to Fatal hilarity (and worse). The comic doesn’t show the path leading from A to B, and finding it is an interesting challenge—think how you would engineer a system that could answer such questions. I believe that this problem and a solution I’ll present demonstrate some programming tools and techniques that will become increasingly important in the years to come.

Continue reading "Farewell to Disks"

2010.08.28

UML, Everywhere

flowchart, n.: The innumerate misleading the illiterate.

— Stan Kelly-Bootle, “The Devil’s DP Dictionary”

A mechanical engineer who sees the symbol ⊥ in a diagram will immediately realize that a feature is specified to be perpendicular to another. In contrast, a software engineer looking at a diagram’s line ending with the symbol ◊ will, at best, wonder whether it denotes aggregation (as in UML), or a “zero or one” cardinality (as in IDEF1X), or something else invented by a creative academic. Worse, many developers will simply scratch their head in bewilderment.

Continue reading "UML, Everywhere"

2010.07.11

Code Documentation

Technical prose is almost immortal.

Continue reading "Code Documentation"

2010.03.04

Software Tracks

A generous car reviewer might praise a vehicle’s handling by writing that it turns as if it’s running on railroad tracks. Indeed, tracks offer guidance and support. When you run on tracks you can carry more weight, you can run faster, and you can’t get lost. That’s why engineers, from early childhood to old age, get hooked on trains. Can we get our software to run on tracks?

There are various tools that can give our software this ability: tools that increase the accuracy and speed of software development, by forcing it to glide on a firm foundation, keeping it away from risky unexplored territory. These tools span the complete spectrum of software building: from better programming abstractions to automating processes.

Types

The main tool for guiding the code’s direction is the language’s type system: a trusted friend who doesn’t allow us to veer in dangerous directions. That’s why programs written in languages with a powerful type system, like Haskell, often work error-free once they pass the compiler’s exacting checks. In contrast, passing around integers to represent anything from Boolean values, to enumerations, to file descriptors, to array indices, as is typically the case in C code, is a potent source of bugs. Similarly, when we program by randomly assembling functions and procedures, as is the case in many languages that don’t enforce design abstractions for code, will run us into problems once the program’s size exceeds what can fit in our mind.

Continue reading "Software Tracks"

2009.10.21

Basic Etiquette of Technical Communication

Parents spend years trying to teach their children to be polite, and some of us had to learn at school how to properly address an archbishop. Yet, it seems that advice on courteousness and politeness in technical communication is in short supply; most of us learn these skills through what is euphemistically called “on the job training.” With enough bruises on my back to demonstrate the amount and variety of my experience in this area (though not my skill), here are some of the things I’ve learned.

Talking to Humans

We developers spend most of our time issuing instructions for computers to execute. This type of command-oriented work can easily lead to déformation professionnelle (see also J. Bigler’s alternative interpretation); I can still remember, years ago, a Navy officer who was talking to his son as if he was ordering a sailor. When we compose a mail message or open a chat window, our keystrokes are directed to another human, not to a shell’s command-line interface. Therefore, we should switch our tone to courteousness, kindness, and consideration. “Please” and “thank you” aren’t part of SQL (or even Cobol; but interestingly “please” is an important part of Intercal), but they should be sprinkled liberally in every discussion between humans. Are you asking a colleague to do something for you at the end of the business day? This isn’t a batch job that a computer will run in the background. Think of how your request may affect your colleague’s family life. Ask him whether he can do it without too much hardship, and at the very least apologize for the urgency of your request.

Starting your exchange with some (sincere) flattery can work wonders. This is especially important if harsh criticism is to follow; it will help you express yourself in a more compassionate way and lift the spirits of the unfortunate soul who will read your words. Imagine the feelings of your email’s recipient by reading your message again through his eyes; according to Human Communication Theory, he will interpret the email more negatively than it was intended. Therefore, aim to encourage rather than complain. If your email is especially harsh, don’t send it immediately. Put it aside and sleep on it or ask other, more experienced, colleagues for advice. Although Google is experimenting with a feature that lets you revoke an email within a very small grace period, in general there’s no way to undo a sent message—you can only regret the damage it made.

Continue reading "Basic Etiquette of Technical Communication"

2009.09.02

Job Security

My colleague, who works for a major equipment vendor, was discussing how his employer was planning to lay off hundreds of developers over the coming months. “But I’m safe,” he said, “as I’m one of the two people in our group who really understand the code.” It seems that writing code that nobody else can comprehend can be a significant job security booster. Here’s some advice.

Unreadable Code

Start by focusing on your code’s low-level details. Nothing puts off maintainers trying to take over your job than code that brings tears to their eyes. Be inconsistent in all aspects of your code: naming, spacing, indenting, commenting, style. Every time there are multiple ways to implement something, throw dice and choose at random. Avoid writing similar code in comparable situations. Spend time coming up with coding tricks that nobody has ever used. Why write a = 0 when you can write a ^= a? Apply this advice liberally in the way you format expressions and statements. There’s only one generally accepted way to space between operators and operands; avoid it. Control flow statements are more fun because there are two schools on where to put braces; randomly switch between them to throw people off.

Unfortunately, these tricks won’t get you far, because beautifiers can readily bring your code up to scratch. However, when naming your variables, methods, fields, and classes, your choices can persist for decades; think of the Unix creat (sic) system call. Some languages, such as Java, have well-established naming conventions regarding capitalization and the joining of words. View them as an opportunity; these rules were designed to be broken. In other languages, such as C++, naming conventions are already severely broken or nonexistent. In this case, you can make your mark by using new names for existing concepts. For instance, name the methods for an iterator’s range start and finish, rather than begin and end. Further innovate by making those ranges symmetric rather than following the customary asymmetric style.

Continue reading "Job Security"

2009.04.15

Drawing Tools

1 Word = 1 Millipicture

— /usr/games/fortune

It’s no accident that in all engineering branches, our colleagues often communicate using drawings and diagrams. Given many artifacts’ scale and complexity, a drawing is often the best way to describe them. Uniquely, in software development we can easily derive pictures from code, and sometimes even code from pictures.

Yet we don’t seem to benefit from drawings in the way other engineers do. Have you ever printed a UML diagram on a large-format plotter? Perhaps part of the problem lies in the fleeting nature of software. Whereas a building’s blueprints can serve its engineers for decades, few of us want to spend valuable time drawing a diagram that will be obsolete in a few years, if not days. We can overcome these problems through tools that automate diagram creation, thus saving us time and helping us keep the diagrams up-to-date.

Continue reading "Drawing Tools"

2009.02.25

Start With the Most Difficult Part

There’s not a lot you can change in the process of constructing a building. You must lay the foundation before you erect the upper floors, and you can’t paint without having the walls in place. In software, we’re blessed with more freedom.

I recently experienced this when I implemented wpl, a small system that extends arbitrary Web pages with links to Wikipedia entries. (Try it at www.spinellis.gr/wpl.) The system has many parts: tools that convert the Wikipedia index into a longest-prefix search data structure; an HTML parser; code that adds links to phrases matching Wikipedia entries; and a Web front end that fetches the page, adds the links, and returns it back. As I was adding the finishing touches, I reflected on the process I used to construct the system. (I find a postmortem examination deeply satisfying, but this is probably because I’m not a medical doctor.)

What struck me were the different approaches I used to construct each of the system’s main parts. For the search data structure, I worked bottom-up: I first read about and experimented with a couple of alternatives, learning about Bloom filters, tries, and Patricia (Practical Algorithm to Retrieve Information Coded in Alphanumeric, and I’m not making this up.) trees. Next, I designed the data layout and the low-level bit-twiddling code to add and locate entries. Only then did I write the tools for building the index and an API for searching entries in the data structure.

For the HTML parser and word-linking code, I started somewhere in the middle. I wrote a state-transition engine to parse “tag soup” HTML (HTML that isn’t necessarily well formed), I extended it with code to add links to suitable text, and I added an appropriate interface.

Continue reading "Start With the Most Difficult Part"

2009.01.21

Brian Kernighan on 30 Years of Software Tools

As part of the IEEE Software 25th anniversary, Brian Kernighan graciously agreed to write a Tools of the Trade column. His article, titled Sometimes the Old Ways are Best, is now freely available online through the Computing Now web site.

Continue reading "Brian Kernighan on 30 Years of Software Tools"

2008.06.26

The Way We Program

If the code and the comments disagree, then both are probably wrong.

Continue reading "The Way We Program"

2008.05.02

Software Builders

The tools and processes we use to transform our system’s source code into an application we can deploy or ship were always important, but nowadays they can mean the difference between success and failure. The reasons are simple: larger code bodies, teams that are bigger, more fluid, and wider distributed, richer interactions with other code, and sophisticated tool chains. All these mean that a slapdash software build process will be an endless drain on productivity and an embarrassing source of bugs, while a high-quality one will give us developers more time and traction to build better software.

Automate

The golden rule of software building is that you should automate all build tasks. The scope of this automation includes setting up the build environment, compiling the software, performing unit and regression testing, typesetting the documentation, stamping a new release, and updating the project’s web page. You can never automate too much. In a project I manage I’ve arranged for each release distribution to pickup from the issue-management database the bugs that the release fixes, and include them in the release notes. Automation serves three purposes: it documents the processes, it speeds up the corresponding tasks, and it eliminates mistakes and forgotten steps. (Did we correctly update the documentation to indicate the software’s current version?)

Continue reading "Software Builders"

2008.03.01

Using and Abusing XML

Words are like leaves; and where they most abound, Much fruit of sense beneath is rarely found.

— Alexander Pope

I was recently gathering GPS coordinates and cell identification data, researching the algorithms hiding behind Google’s “My Location” facility. While working on this task, I witnessed the great interoperability benefits we get from XML. With a simple 140-line script, I converted the data I gathered into a de facto standard, the XML-based GPS-exchange format called GPX. Then, using a GPS-format converter, I converted my data into Google Earth’s XML data format A few mouse clicks later, I had my journeys and associated cell tower switchovers beautifully superimposed on satellite pictures and maps.

Convenient versatility

XML is an extremely nifty format. Computers can easily parse XML data, yet humans can also understand it. For example, a week ago a UMLGraph user complained that pic2plot clipped elements from the scalable vector graphics (SVG—another XML-based format) file it generated. I was able to suggest a workaround that modified the picture’s bounding box, which was clearly visible as two XML tag attributes at the top of the file.

Continue reading "Using and Abusing XML"

2008.01.13

Rational Metaprogramming

Metaprogramming, using programs to manipulate other programs, is as old as programming. From self-modifying machine code in early computers to expressions involving partially applied functions in modern functional-programming languages, metaprogramming is an essential part of an advanced programmer’s arsenal.

Also known as generative programming, metaprogramming leverages a computer language’s power. Rather than manipulating plain data elements, it manipulates symbols representing various complex operations. However, like all levers, metaprogramming can be a blunt instrument. Small perturbations on the lever’s short end (the language) result in large changes in the final product. In metaprogramming, this leverage can lead to unmaintainable code, insidious bugs, inscrutable error messages, and code-injection-attack vulnerabilities. Even when we take away industrial-strength compilers and interpreters, which are also metaprograms, we find metaprogramming wherever we look.

Continue reading "Rational Metaprogramming"

2007.11.10

On Paper

A box of crayons and a big sheet of paper provides a more expressive medium for kids than computerized paint programs.

— Clifford Stoll

This column came to life as I was trying to devise an algorithm for analyzing initializers for C arrays and structures. At the time I was using the CScout refactoring browser to look for possible differences between closed and open source code. I had already processed the Linux, FreeBSD, and Windows research kernel source codel and only the OpenSolaris kernel remained. Unlikethe other three code bases, Sun’s code didn’t appear to use any exotic compiler extensions, so CScout uncomplainingly devoured one file after the next. Then, after aspproximately six hours of processing and 80 percent along the way, it reported a syntax error.

Most errors I encounter when processing C code with CScout are easy to handle. I add a macro definition to simulate a compiler built-in function, I fix a corner case in my code, or I add a grammar rule to cater for a compiler extension. This time it was different. Horrified, I realized that my implementation for handling C’s initializers was far from what was actually needed. The requirement to fully evaluate compile-time constants, which I had hid under the carpet’til then, was the least of my problems. Two days and 550 lines of code later, I found myself struggling with an algorithm to drill down, move along, and climb up a stack of data type stacks matching initialized elements to their initializers. And I was doing this on a sheet of paper.

Continue reading "On Paper"

2007.09.02

Abstraction and Variation

“Master, a friend told me today that I should never use the editor’s copy-paste functions when programming,” said the young apprentice. “I thought the whole point of programming tools was to make our lives easier,” he continued.

The Master stroked his long grey beard and pressed the busy button on his phone. This was going to be one of those long, important discussions.

“Why do you think copy-pasting is wrong?” asked the Master.

Continue reading "Abstraction and Variation"

2007.06.28

The Tools we Use

It is impossible to sharpen a pencil with a blunt ax. It is equally vain to try to do it with ten blunt axes instead.

— Edsger W. Dijkstra

Continue reading "The Tools we Use"

2007.04.30

Silver Bullets and Other Mysteries

It seemed like a good idea at the time.

—Ken Thomson, on naming the Unix system call to create a file "creat"

When conference participants interrupt a speaker with applause, you know the speaker has struck a chord. This happened when Alan Davis, past editor in chief of IEEE Software, gave a talk on improving the requirements engineering process at the NASSCOM (Indian National Association of Software and Services Companies) Quality Summit in Bangalore in September 2006. He was explaining why a marketing team will often agree with developers on additional features and a compressed delivery schedule that both sides know to be unrealistic. The truth is that this places the two parties in a Machiavellian win-win situation. When the product's delivery is inevitably delayed, the developers will claim that they said from the beginning that they couldn't meet the schedule but that marketing insisted on it. The marketing people also end up with a convenient scapegoat. If the product launch is a flop, they can say they missed a critical marketing time window owing to the product's delay. Where else are we playing such games?

Aging systems

Consider a 15-year-old software system. Its design doesn't match the environment it operates in, its original developers have matured during its lifetime, and hundreds of fixes and improvements have accumulated thick layers of "cruft" (redundant or poorly designed areas) all over its code base. Any sensible software engineer would argue that the system is ready for scrapping and rebuilding from scratch.

However, pointing out this fact is bad for all parties involved. It shows that developers haven't really done a stellar job over the years; they'll have to admit that many of their design decisions turned out to be incorrect. Getting a system's design wrong is natural because, first of all, the environment a system operates in changes the moment the system is installed and, second, because the expectations people have of a system change with time. To get a feeling of changed expectations, try typing a page on a typewriter—once considered to be the perfect tool for writing neat documents. In this context, hindsight is treacherous and unfair because it changes the rules of the game, after the game has finished. Although in the year 2000 Pets.com managed to raise US$82.5 million in an IPO, seven years later those same people who bought Pets.com stock would ask: "A site selling pet food over the Web? What were they thinking?"

Additionally, in our relatively young profession, many aging systems were originally written by rookie programmers who were cutting their teeth on code for the first time. So, the product is likely badly designed, its code lacking in structure and consistency. I often look at code I wrote several years ago, and I can immediately realize in which phase of programming immaturity and folly I was in. There was a phase when I thought that "shrt idntfrs wr cool," one where I tried to exploit every trick of the C programming language—because I could—and one where I hadn't yet learned to comment my code (even if the only eyes that would see it were my own). I wonder what I'll think in 10 years of the code I write now.

Even if by chance a system was perfectly designed from the start to match the environment it would be used-in in 10 years time, its code would still show the signs of time. Successive fixes and improvements typically violate initial assumptions. Developers who fail to understand an aspect of the system's design will add their bit in a different way. Or, even if they understand the design, they might not understand the system's coding conventions and use different ones. This occurs often in systems written in languages such as C++, where incompatible identifier naming conventions coexist, even within the language's own libraries. Worse, other developers will duplicate code, violating the Don't Repeat Yourself (DRY), single point of control principle, increasing the risks of future changes. In sum, the code will accumulate cruft and become unmaintainable.

We're accustomed to aging in the physical world. We know that people, dogs, cars, ships, clothes, and computers have a finite lifetime. Our experience with immaterial creations is mixed. The works of Homer, Shakespeare, Mozart, and, dare I say, the Beatles haven't really deteriorated over the years. On the other hand, a journal article in software engineering passes its prime (it reaches the so-called aggregate cited half-life) in eight years; in the sprightly field of nanotechnology, this figure is just four years. Unfortunately, we haven't yet come to terms with the idea that software ages, often beyond salvation.

On to silver bullets

So what can developers do when faced with an aged software system? They could simply come clean. Claim that the code that they were paid to design, write, and maintain is a pile of excrement and ask for another chance to do it right. Even the most thick-skinned and politically naive developer will, however, realize that this isn't a smart move. Coming clean is also a problem for the developers' managers, because they'll have to explain the mess to their higher-ups.

This is where a silver bullet comes in handy. Imagine a system that cost $300,000 to develop and in which, over the years, its owners invested another $700,000 to maintain and enhance. Starry-eyed managers might think they have a system worth a million dollars on their hands, but we all know that due to its age, the system's real value is a tiny fraction of that. The developers continually find themselves in the uncomfortable position of having to explain why new changes are so costly and time-consuming and why each improvement and fix introduces so many new bugs.

One day, a godsend order for a major enhancement comes in, and the developers estimate its cost at $500,000. Before their client has time to recover, they claim that a revolutionary new technology is available that will let them build with this sum both the existing system and the new enhancement from scratch. Moreover, by adopting this new technology, future enhancements will cost only a fraction of what they would cost using the old technology.

The precise nature of this technology claiming to offer dramatic productivity improvements is unimportant. At various times, this silver bullet has been known by names such as structured programming, object-oriented languages, 4GLs (fourth-generation programming languages), CASE (computer-aided software engineering) tools, RDBMSs (relational database management systems), XML, visual programming, n-tier architectures, managed code—the list goes on. What's important is that the move suits everybody perfectly. Developers can abandon their old code without having to explain the awkward truth; they'll also get to update their technical skills and brighten their employment prospects. Managers will be seen as heroes for taking a bold step with the new technology (in management, action is often mistaken for achievement). Conveniently, at this point an army of vendors will also step in to offer to their eager listeners supporting evidence and success stories. And—this is the icing on the cake—by following this route, developers and managers also buy an insurance policy. If the transition plan fails and the bullet's magical productivity increases don't materialize, they can claim that the technology is still immature or had hidden flaws.

What's my opinion of this charade? Software ages and becomes increasingly expensive to maintain. New technologies offer modest but not spectacular improvements in productivity. It's therefore sensible from time to time to rebuild a system from scratch. It might be harmless and politically expedient to claim that we've found a silver bullet, but it's even better to know what we're really doing.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Silver bullets and other mysteries. IEEE Software, 24(3):22–23, May/June 2007. (doi:10.1109/MS.2007.88)

Continue reading "Silver Bullets and Other Mysteries"

2007.04.09

I Spy

Knowledge is power.

—Sir Francis Bacon

The ultimate source of truth regarding a program is its execution. When a program runs everything comes to light: correctness, CPU and memory utilization, even interactions with buggy libraries, operating systems, and hardware. Yet, this source of truth is also fleeting, rushing into oblivion at the tune of billions of instructions per second. Worse, capturing that truth can be a tricky, tortuous, or downright treacherous affair.

Often to be able to peek into a program's operation we need to prepare a special version of it: compile it with specific flags or options, link it with appropriate libraries, or run it with suitable arguments. Often, we can't easily reproduce a problem, so our carefully crafted program version must be shipped to a customer waiting for the problem to appear again. Irritatingly, some of the ways we use for instrumenting programs make the program too slow for production use, or obfuscate and hide the original problem.

A family of tools ...

There's no shortage of techniques for spying on a program. If we care about CPU utilization, we can run our program under a statistical profiler that will interrupt its operation many times every second and note where the program spends most of its time. Alternatively, we can arrange for the compiler or runtime system to plant code setting a time counter at the beginning and end of each function, and examine the time difference between the two points. In extreme cases we can even have the compiler instrument each basic block of our code with a counter. Some tools that use these approaches are gprof , and gcov under Unix, EJP and the Eclipse and NetBeans profiler plugins for Java programs, and NProf, and the CLR profiler for .NET code. Memory utilization monitors typically modify the runtime system's memory allocator to keep track of our allocations. Valgrind under Unix and the Java SDK JConsole are two players in this category.

When we try to locate a bug the approach we use involves either inserting logging statements in key locations of our program, or running the code under a debugger, which allows us to dynamically insert breakpoint instructions. We discussed both approaches in the May/June 2006 column.

Nowadays, however, most performance problems and quite a number of bugs involve the use of third party libraries or interactions with the operating system. One way to resolve these issues is to look at the calls from our code to that other component. By examining the timestamp of each call or looking for am abnormally large number of calls we can pinpoint performance problems. The arguments to a function can also often reveal a bug. Tools in this category include ltrace, strace, ktrace, and truss under dialects of Unix, and APIS32 or TracePlus under Windows. These tools typically work by using special APIs or code patching techniques to hook themselves between our program and its external interfaces.

Finally there are cases where our program works fine, but the operating system acts up. In these cases we must put a stethoscope on the operating system's chest to see what is going on. Fortunately, modern operating systems zealously monitor their operation, and expose those figures through tools like vmstat, netstat, and iostat under Unix or the Event Tracing for Windows framework.

Most of the tools we've examined so far have been around for ages, and can be valuable for solving a problem once we've located its approximate cause. They also have a number of drawbacks: they often require us to take special actions to monitor our code, they can decrease the performance of our system, their interfaces are idiosyncratic and incompatible with each other, each one shows us only a small part of the overall picture, and sometimes important details are simply missing.

... and one tool to rule them all

The gold winner in The Wall Street Journal's 2006 Technology Innovation Awards contest was a tool that addresses all the shortcomings I outlined. DTrace, Sun's dynamic tracing framework, provides uniform mechanisms for spying comprehensively and unobtrusively on the operating system, the application servers, the runtime environments, libraries, and application programs. It is available in open source form under Sun's fairly liberal Common Development and Distribution License. At the time of writing DTrace is part of Sun's Solaris 10, and it is also being ported to Apple's Mac OS X version 10.5 and the FreeBSD operating systems. If you don't have access to DTrace, an easy way to experiment with it is to install a freely downloadable version of Solaris Express on an unused x86 machine. I must warn you however that I've found DTrace to be seriously addictive.

Unsurprisingly, DTrace is not a summer holiday hack. The three Sun engineers behind it worked for a number of years to develop mechanisms for safely instrumenting all operating system kernel

functions, any dynamically linked library, any application program function or specific CPU instruction, and the Java virtual machine. They also developed a safe interpreted language that we can use to write sophisticated tracing scripts without damaging the operating system's functioning, and aggregating functions that can summarize traced data in a scalable way without excessive memory overhead. DTrace integrates technologies and wizardry from most existing tools and some notable interpreted languages to provide an all-encompassing platform for program tracing.

We typically use the DTrace framework through the dtrace command-line tool. To this tool we feed scripts we write in a domain-specific language called D; dtrace installs the traces we've specified, executes our program, and prints its results. D programs can be very simple: they consist of pattern/action pairs like those find in the awk and sed Unix tools and many declarative languages. A pattern (called a predicate in the DTrace terminology) specifies a probe—an event we want to monitor. DTrace comes with thousands of pre-defined probes (49,979 on my system). In addition, system programs (like application servers and runtime environments) can define their own probes, and we can also set a probe anywhere we want in a program or in a dynamically linked library. For example, the command

dtrace -n syscall:::entry
will install a probe at the entry point of all operating system calls, and the (default) action will be to print the name of each system call executed and the process id of the calling process. We can combine predicates and other variables together using Boolean operators to specify more complex tracing conditions.

The name syscall in the previous invocation specifies a provider—a module providing some probes. Predictably, the syscall provider provides probes for tracing operating system calls; 463 probes on my system. syscall::open:entry is one of these probes—the entry point to the open system call. DTrace contains tens of providers, providing access to statistical profiling, all kernel functions, locks, system calls, device drivers, input and output events, process creation and termination, the network stack's management information bases (MIBs), the scheduler, virtual memory operations, user program functions and arbitrary code locations, synchronization primitives, kernel statistics, and Java virtual machine operations.

Together with each predicate we can define an action. This action specifies what DTrace will do when a predicate's condition is satisfied. For example, the following command

dtrace -n 'syscall::open:entry {trace(copyinstr(arg0));}'
will list the name of each opened file.

Actions can be arbitrarily complex: they can set global or thread-local variables, store data in associative arrays, and aggregate data with functions like count, min, max, avg, and quantize. For instance, the following program will summarize the number of times each process gets executed over the lifetime of the dtrace invocation.

proc:::exec-success { @proc[execname] = count()}

In typical use DTrace scripts span the space from one-liners, like the ones we saw, to tens of lines containing multiple predicate action pairs. A particularly impressive example listed on a DTrace blog illustrates the call sequence from a Java method invocation, through the C libraries, an operating system call, down to the operating system device drivers. As software engineers we've spent a lot of effort creating abstractions and building walls around our systems; more impressively it looks like we've also found ways to examine our isolated blocks holistically as one system.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. I Spy. IEEE Software, 24(2):16–17, March/April 2007. (doi:10.1109/MS.2007.43)

Continue reading "I Spy"

2006.12.15

Cracking Software Reuse

[Newton] said, "If I have seen further than others, it is because I've stood on the shoulders of giants." These days we stand on each other's feet!

— Richard Hamming

Sometimes we encounter ideas that inspire us for life. For me, this was a Unix command pipeline I came across in the '80s:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq |
comm -23 - /usr/dict/words

This will read a text document from its standard input and produce a list of misspelled words. It works by transforming all nonalphabetic characters into new lines, folding uppercase letters to lowercase, sorting the resultant list of words, removing duplicates, and finally printing those words that don't appear in the system dictionary. By fixing the system dictionary's location, which has moved over the years, I successfully tested the pipeline on modern FreeBSD and Linux systems. Impressively, on one of those increasingly common multiprocessor machines, the pipeline used 1.25 of the two available processors—a feat, even by modern standards. However, when I first saw the pipeline, portability and multiprocessor use weren't on my radar screen. What impressed me was how five straightforward commands, running on a relatively simple system and supporting a few powerful abstractions could achieve so much.

I hoped life in computer science would be a series of such revelations, but I was in for disappointment. Things appeared to be going downhill from then on. I saw systems become increasingly complex, the tools that first impressed me fall into disuse, and the concept of reuse preached more than practiced. A "hello world" program in the then shiny new X Window System or Microsoft Windows was a 100-line affair. I felt our profession had hit a new low when I realized that a particularly successful ERP (enterprise resource planning) system used a proprietary in-house developed database and programming language. It seemed Hamming was right.

An unexpected picture

Yet, progress moves in surprising ways. Nowadays, I'm proud of our achievements and optimistic about our future. Look at figure 1, depicting a position of what became known as the Game of the Century: a chess game played between Donald Byrne and 13-year old Bobby Fischer on 17 October 1956. Although the game was remarkable, so is the ecosystem behind the picture.

Chessboard rendering
{{Chess diagram small|=
| tright
|
|=

 8 |rd|  |  |qd|  |rd|kd|  |=
 7 |pd|pd|  |  |pd|pd|bd|pd|=
 6 |  |nd|pd|  |  |nd|pd|  |=
 5 |  |  |ql|  |  |  |bl|  |=
 4 |  |  |  |pl|pl|  |bd|  |=
 3 |  |  |nl|  |  |nl|  |  |=
 2 |pl|pl|  |  |  |pl|pl|pl|=
 1 |  |  |  |rl|kl|bl|  |rl|=
    a  b  c  d  e  f  g  h

| The position after 11. Bg5.
}}
Figure 1. A chess board diagram and its layout description.

The picture on the left comes from the Wikipedia article on the game (http://en.wikipedia.org/wiki/The_Game_of_the_Century_(chess), as of 21 October 2006). To create it, one of the article's 65 contributors wrote the layout appearing on the figure's right, using a readable and concise domain-specific mini-language. Despite what you might think, this chessboard description language isn't an inherent part of MediaWiki—Wikipedia's engine. Instead, it's a MediaWiki template: a parameterized, reusable formatting element. About a dozen people wrote this particular template, using MediaWiki's low-level constructs, such as tables and images.

Digging deeper, we'll find that MediaWiki consists of about 175,000 lines of PHP (hypertext preprocessor) code using the MySQL relational database engine. A rough count of C/C++ source code files in the PHP and MySQL distributions gives us 740,000 and 1.8 million lines, respectively. And underneath, we'll find many base libraries on which PHP depends, the Apache and Squid server software, and a multimillion-line-large GNU/Linux distribution. In all, we see a tremendously complex system that lets hundreds of thousands contributors cooperatively edit two million pages—and still manages to serve more than 2000 requests each second.

How we won the war

We must be doing something right. Keeping Wikipedia's components freely available has surely helped, but there's more than that in our recent successes. One important factor is that we've (almost) sorted out the technology for reuse. Huge organized archives, pioneered by the Comprehensive TeX Archive Network (CTAN) and popularized by Perl's Comprehensive Perl Archinive Network (CPAN), lets us publish and locate useful components. The package management mechanisms of many modern operating systems have simplified the installation and maintenance of disparate components and their intricate dependencies. Programming languages now offer robust namespace management mechanisms to isolate the interactions between components. Shared libraries have matured providing us with vital savings in memory consumption: on a lightly loaded system, I recently calculated that shared libraries saved 97 percent of the memory space that we'd need without them. Widely used proprietary offerings, such as Microsoft's .NET and the Java Platform Enterprise Edition, have also helped code reuse, by integrating into their libraries everything but the kitchen sink. In all, the factors determining our return on investment from the components we reuse have moved in the right direction: modern components, like the chess description language, offer more and demand less.

A second important factor of our successes is the emergence of new types of collaboration. Version-control systems, bug-management databases, mailing lists, and wikis form the glue of modern large development teams. At the same time, code repositories, RSS feeds, automatic software update systems, and more mailing lists bring together component producers and consumers. Claiming that the Internet has revolutionalized software development might sound farfetched, but we've got to remember that 20 years ago, systems of the size we've seen were developed only by NASA and large defense contractors, not volunteers working in their spare time.

Like many of my generation, one of my early sources of inspiration, predating the Unix pipeline, was Star Trek's USS Enterprise. I marveled its intricate technology but always wondered how it was built and maintained, especially when pieces of it got torn apart in battles. Clearly, the development model that gave us the Doric beauty of Unix and its tools couldn't be extended to cover the Enterprise's baroque complexity. I used to think that I would have to take that particular aspect of the Star Trek offering with a pinch of salt. Now I see that we computing professionals are developing an ecosystem where large, intricate systems can grow organically. And this is another truly inspiring idea.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Cracking Software Reuse. IEEE Software, 24(1):12–13, January/February 2007. (doi:10.1109/MS.2007.9)

Continue reading "Cracking Software Reuse"

2006.09.01

Open Source and Professional Advancement

Doing really first-class work, and knowing it, is as good as wine, women (or men) and song put together.

— Richard Hamming

I recently participated in an online discussion regarding the advantages of the various certification programs. Some voiced skepticism regarding how well one can judge a person's knowledge through answers to narrowly framed multiple choice questions. My personal view is that the way a certification's skills are examined is artificial to the point of uselessness. In practice I often find solutions to problems by looking for answers on the web. Knowing where and how to search for an answer is becoming the most crucial problem-solving skill, yet typical certification exams still test rote learning. Other discussants suggested that certification was a way to enter into a job market where employers increasingly asked for experience in a specific technology. My reaction to that argument was that open source software development efforts offer us professionals a new and very valuable way to obtain significant experience in a wide range of areas. In this column I'll describe how we can advance professionally by contributing to open source projects.

Software development

The most obvious way to gain as a professional from open source software is by fixing and improving existing open source code. We all know that 40–70% of the effort that goes into a software system is spent after the system's first incarnation. Yet coursework and textbook exercises seldom ask us to maintain an existing system. On the other hand there are many (perhaps too many) open source projects with lists choke-full of exciting additions and obscure bugs eagerly waiting for us developers to get our hands on them. By joining an existing open source software project we can immediately practice the art of maintaining other people's code and sharpen our corresponding skills. Also, once we get our hands dirty, we will see ourselves gradually adopting a code style that is more readable and maintainable.

Joining an open source project is an easy way to rub shoulders and interact with highly respected professionals. From day one we can see how their code looks like, the way they address new issues, and how they interact with other developers. Even better, as we begin to contribute code to the project these colleagues may send us feedback that will help us improve. (The first comment I got from my FreeBSD mentor when I submitted code for review was that I had left blank spaces in the ends of the lines. Up to that point I never thought that these could be an issue; from then onward I configured my editor to color them in yellow, so I could easily spot them.) I can't guarantee you friction-free interactions with the other developers, but those heated email exchanges can help us become better team players: we'll gradually learn to focus on the technical aspects of an argument and not take personally attacks on our code.

Through participation to open source projects we can also perfect our non-verbal communication skills. Open source projects, being globally distributed, typically rely on a multitude of collaboration tools ranging from email and instant messaging to issue management databases, wikis, and version control systems. Communicating requirements, design and implementation options, and bug descriptions through these media in a precise and technically-objective manner is an important skill in today's global marketplace, and one we will surely sharpen through our exchanges with our fellow open source developers.

An valuable feature of the open source landscape is the breadth of available applications, implementation technologies, and project sizes. By choosing cleverly we can maximize both our professional gain and our personal joy. We can select a project to learn a new technology, or to improve our skills in an existing one. We can thus transfer our skills from, say ASP to Ajax, or cut our teeth on advanced Java programming through Eclipse's multi-million line code base. We can also enter a new application domain, like game programming, networking, or kernel hacking. Finally, the wide diversity of open source project sizes gives us an excellent opportunity to inject variety into our professional life. If we work in a small startup we can join a large project to gain a taste of a structured development process; if our company is process-heavy joining a small project allows us to experience once again the joy of coding.

System administration

Increasingly, software systems are not monolithic blocks, but complex large heterogeneous ecosystems. In such an environment the software professional is required to be a system administrator, selecting, configuring, connecting, and tuning subsystems into a robust and efficient larger part. Again, the typical classroom or corporate development setting is often a sterile affair involving preselected, preinstalled, and preconfigured components that just work. With open source software development we can get the larger picture. We'll get a chance to experiment with the components, tools, and our development environment, choosing and configuring a setup that works and allows us to be productive. At different times we wear the hats of a system administrator tinkering with operating system releases, a database administrator configuring a relational database, a security officer implementing our security policy and installing patches, and a network manager making the pieces of second-hand hardware junk that inevitably piles-up in the basement of any self-respecting hacker talk to each other. Any of these skills is valuable in today's marketplace, and the cross-disciplinary mixture that we'll acquire from our involvement in open source projects is even more so.

Development process

Consider development practices like issue tracking, version control, unit testing, style guidelines, the daily build, code reviews, release engineering, and traceability. If you work in a small development group or a startup, chances are that you (or, worse, your boss) consider some of these practices obscure and irrelevant. Yet they are anything but. Although a single talented programmer can often get away with developing software by piling code layer upon code layer, this process is not sustainable in the long run. When the software and the team that builds it grows large, failure to adopt a process that includes the practices I named borders on hubris. Joining a large open source development project will get you first hand experience with many cutting edge development practices. Thus, apart from polishing your coding skills, you'll also become a better manager by observing how things you may have heard only in a boring software engineering lecture actually work in large projects in the real world.

Later on, you'll hopefully also contribute. Despite the size and complexity of some large open source development efforts, most projects are still typically too light on process, so it's relatively easy for somebody with time and ideas to make a contribution in this area. Initially, this can be simply a skunk works subproject you launch on your own: a framework for unit or regression testing, a bug finding tool, or a more efficient release mechanism. As your idea is proven on the field and you gain the respect from other developers this pet project of yours can become officially adopted.

Cashing in

Proponents of the concept of psychological egoism maintain we are always motivated by self-interest, even when we behave altruistically: deep down we seek the better feeling we derive from our acts. This argument has been criticized as circular and non-falsifiable. Fortunately when working on open source projects we won't have to entangle ourselves in this logic: there's nothing wrong with advancing professionally while helping worthwhile projects. Nobody (yet) has promised eternal life through code churning.

We already saw how participation in open source projects can make us better programmers, system administrators and managers. As contributors to open source projects we can also often gain a significant edge in interviews: ("I see you're using Firefox as your browser. You know, I've implemented the hyphenation functionality in the text rendering module.") Developers with commit privileges in certain high-profile open source projects often find themselves in a seller's market. Demand for their skills typically outstrips the available supply and they can therefore command better employment terms. Nevertheless, in the end, the best reward we gain from our participation in open source projects is the joy of contributing to a work that can potentially affect the life of millions of people.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Open Source and Professional Advancement. IEEE Software, 23(5):70–71, September/October 2006. (doi:10.1109/MS.2006.136)

Continue reading "Open Source and Professional Advancement"

2006.07.01

Choosing a Programming Language

A language that doesn't have everything is actually easier to program in than some that do.

— Dennis M. Ritchie

Computer languages fascinate me. Like a living person, each one has its own history, personality, interests, and quirks. Once you've learned one, you can use it again after years of neglect, and it's like reconnecting with an old friend: you can continue discussions from the point they were broken off years before. For a task I recently faced I adopted a language I hadn't used for 15 years, and felt enlightened.

Let me start by stressing that I don't think there's one language suitable for all tasks, and probably there won't ever be one. In a typical workweek I seldom program in fewer than three different languages. The most difficult question I face when starting a new project is what language to use. Factors I balance when choosing a programming language are programmer productivity, maintainability, efficiency, portability, tool support, and software and hardware interfaces.

Hard choices

Often a single one of these factors is decisive and leaves little room for choice. If you have to squeeze your interrupt-driven code in a microcontroller's 1024 bytes of memory assembly language or maybe C is the only game in town. If you're going to interface with a Java-based application server then you write in Java. Sometimes tradition plays an important role. Systems code, like operating systems, device drivers, and utility programs, is typically written in C. Following this tradition means that the code will mesh well with its surrounding environment and won't impose on it onerous requirements for libraries and runtime environments.

At other times the choice of the programming language is a fine balancing act. I find the power of C++ and its standard template library amazing: the combination provides me with extreme efficiency and expressiveness. At a price. The language is large and complex; after 15 years of programming in C++ I'm still often puzzled by the compiler's error messages and I routinely program with a couple of reference books on my side. Time I spend looking up an incantation is time not spent programming. Modern object-oriented languages like Java and C# are more orthogonal and hide fewer surprises for the programmer, although the inevitable accumulation of features makes this statement less true with each new version of each language. It looks like Lehman's laws of software evolution ("as a program is evolved its complexity increases") haunt us on every front. On the other hand, sometimes you just can't afford Java's space overhead. I recently wrote a program that manipulated half a billion objects. Its C++ implementation required 3GB of real memory to run; a Java implementation would easily need that amount of memory just for storing the objects' housekeeping data. I could not afford the additional memory space, and I'm sure even our more generously funded colleagues at CERN facing a one petabyte per second data stream in their large hadron collider experiment feel the same way.

The situations however I described are outliers. In many more cases I find myself choosing a programming language based on its surrounding ecosystem. If I'm targeting a Windows audience, the default availability of the .NET framework on all modern Windows systems makes the platform an attractive choice. Conversely, if the application will ever be required to run on any other system, then using the .NET framework will make porting it a nightmare. Third party libraries also play here an important role. Nowadays many applications are built by gluing together many other libraries. I recently calculated that each of the 20 thousand applications that have been ported to the FreeBSD system depends on average on 1.7 third party libraries that are not available on the system's default installation; one application depends on 38 different libraries. Thus for example if your application requires support for 3D rendering, Bluetooth communications, the creation of PDF documents, an interface to a particular RDBMS, and public key cryptography you may find that these facilities are only available for a particular language.

Soft choices

When efficiency, portability, and library availability don't force a language on me the next decisive factor is programmer productivity.

Interestingly here, I've found that the same language features can promote or reduce productivity depending on the work's scope. For small tool-type programs I write in the course of my work I prefer a language that sustains programmer abuse without complaint. When I want to put together a program or a one-line command in a hurry, I appreciate that Perl and the Unix shell scripting facilities don't require me to declare types and split my code into functions and modules. Other programmers use Python and Ruby in the same way.

However, when the program is going to grow large, will be maintained by a team, or be used in a context where errors matter a lot, I want a language that enforces programming discipline. One feature I particularly appreciate is strict static typing. Type errors that the compiler catches are bugs my users won't face. Language support for splitting programs into modules and hiding implementation details is also important. If the language (or the culture of developing in that language) enforces these development traits, so much the better. Thus, although I realize one can write well structured hundred thousand line programs in both Perl and in Java, I feel that the discipline required to get this right in Perl is an order of magnitude higher than that required for Java, where even rookie programmers routinely split their code into classes and packages.

A language's supporting environment is also important here. Nowadays, a programmer's productivity in a given language is often coupled with the use of an IDE. Some tasks, like developing a program's GUI layout, are painful without an appropriate IDE, and some colleagues have become attached to a particular IDE in the same way I'm clinically dependant on the vi editor. Thus choosing a language often involves selecting one of those a particular IDE supports.

Declarative choices

There are also cases where a program's application domain will favor the expressive style of a specific language. The three approaches here involve using an existing domain-specific language, building a new one, or adopting a general-purpose declarative language.

If you want to get some figures from a database you might write SQL queries; if you want to convert an XML document into a report you should try out XSLT. Building a special-purpose language may sound daunting, but is actually not that difficult if one takes the appropriate shortcuts. Such an approach can be a tremendous productivity booster. Fifteen years ago I designed a simple line-oriented DSL to specify the parameters of CAD system's objects. Instead of designing an input window layout for each input group one simply specifies declaratively what the user should see and manipulate. Thus, the system's initial 150 parameters have effortlessly swelled over the years to 2400 surviving intact a port to a different GUI platform.

When I recently set out to design a way for specifying complex financial instruments my first attempt was to design a DSL. However, the more I worked on the problem the more I realized that many of the features I wanted, like the manipulation of lists and trees, were already available on declarative languages like Prolog, Lisp, ML, and Haskell. After expressing a small subset of the problem in a number of these languages I singled out Haskell, a language I had to learn when writing a compiler for it as an undergraduate student. It seemed to offer a concise way to express everything I wanted and a no-frills but remarkably effective development environment.

My biggest surprise came when I started testing the code I wrote. Most programs worked correctly the first time on. I can attribute this to three factors. Haskell's strong typing filtered out most errors when I compiled my code. Furthermore the language's powerful abstractions allowed me to concisely express what I wanted, limiting the scope for errors (research has shown that the errors in a program are roughly proportional to its size). Finally, Haskell as a pure functional language doesn't allow expressions to have side effects and thus forced me to split my program into many simple, easy to verify functions. Over the years many friends and books have prompted me to evaluate the use of a functional language for implementing domain-specific functionality; as I continue to add Haskell functions to my program I can see that the choice of the appropriate programming language can make or break a project.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Choosing a Programming Language. IEEE Software, 23(4):62–63, July/August 2006. (doi:10.1109/MS.2006.97)

Continue reading "Choosing a Programming Language"

2006.05.01

Debuggers and Logging Frameworks

As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered.

— Maurice Wilkes discovers debugging, 1949

The testing, diagnostic, and repair equipment of many professions is horrendously expensive. Think of logic analyzers, CAT scanners, and dry docks. For us the cost of debuggers and logging frameworks is minimal; some of them are even free. All we need to become productive, is to invest some time and effort to learn how to use these tools in the most efficient and effective way.

Assuming that the bug-finding systems we discussed in our previous column have given our program's code a clean bill of health, using debugger or logging instrumentation is the most productive way for pinpointing errors that have managed to creep in our code. With these tools we can often get a starting point for locating a bug, and then also verify our hypotheses on what is going wrong. As one would expect, adopting an appropriate strategy and mastering the corresponding techniques are the important factors for making the best out of these tools.

Debugging strategies

The most efficient debugging strategy is a bottom-up one: we start from the symptom and look for the cause. The symptom can be a memory access violation (for example the dereferencing of a NULL pointer), an endless loop, or an uncaught exception. A debugger will typically allow us to get a snapshot of the program at the point where the symptom occurred. From that snapshot we can examine the program's stack frame: the sequence of function or method invocations that led to the execution of the problematic code. At the very least we thus obtain an accurate picture of our program's runtime behavior. Even better, we can also examine the values of variables at each level of the stack frame to really understand what brought our program belly-up.

Unfortunately, there are times when we can't adopt a bottom-up strategy. This situation crops up when the bug's symptom can't be precisely tied to a debugger event. Our program may cause a problem in another application, or the contents of a variable may be wrong for reasons we can't explain. In such cases top-down is the name of the game. Debuggers allow us to step through the code, stepping over or into functions and methods. When we debug in a top-down fashion we initially step over bodies of code we consider irrelevant, narrowing down our search as we come nearer the problem's manifestation. This strategy requires patience and persistence. Often we step-over a crucial function and find ourselves having to repeat the search aiming to step-into the function the next time round. This process can be tiring, but sooner or later will produce results.

There are also cases where we may have to debug a program at the level of assembly code: either because we don't trust the compiler, or because we don't have access to the program's source code. What I've found over the years is that assembly code is a lot less intimidating than it appears. Even if we don't know the processor's architecture, a few educated guesses and a bit of luck often allow us to decipher the instructions needed to pinpoint the problem.

Debugging techniques

Code and data breakpoints

Stack frame printouts and stepping commands are the basic and indispensable debugging tools, but there are more powerful commands that can often help us locate a tricky problem. A code breakpoint allows us to stop the program's execution at a specific line. We often use those to expedite a top down bug search, by placing a breakpoint before the point where we think the problem lies. In such cases we use the breakpoint as a bookmark for the location where we want to look at the program's operation in more detail.

Less known, but no less valuable, are data breakpoints—also known as watchpoints. Many modern processors provide hardware support that will interrupt a program's execution when the code accesses the contents of some specified memory locations. Data breakpoints leverage this support allowing us programmers to specify that the program's execution will stop when its code reads or writes a variable, an array or an object. Note that debuggers that implement such commands without hardware support slow down the program's execution to a crawl rendering this command almost useless (Java tool builders take note).

Live, post mortem and remote debugging

Although the typical set-up involves us starting the misbehaving program under a debugger, there are also other debugging options that can often help us escape a tight corner.

Consider non-reproducible bugs, also known as Heisenbugs, because they make our program appear as if it's operating under the spell of Heisenberg's uncertainty principle. We can often pinpoint those by debugging a program after it has crashed. On typical Unix systems crashed programs will leave behind them an image of their memory, the core dump. By running a debugger on this core dump we get a snapshot of the program's state at the point of the crash. Windows, on the other hand, offers us the possibility to launch a debugger immediately after a program has crashed. In both situations we can then look at the location of the crash, and examine the values the variables had at the time. If the program hasn't crashed but is acting weirdly, we can attach a debugger to that running process, and examine its operation from that point onward using the debugger's commands.

Another class of applications that are difficult to debug are those with an interface that's incompatible with the debugger's. Embedded systems, operating system kernels, games, and applications with a cranky GUI fall in this category. Here the solution is remote debugging. We run the process under a debugger, but interact with the debugger's interface on another system, connected through the network or a serial interface. This leaves the target system almost undisturbed, but still allows us to issue debugging commands and view their output from our debugging console.

The Logging Controversy

Instructions in the program's code that generate logging and debug messages allow us to inspect a program's behavior without using a debugger. Some believe that logging statements are only employed by those who don't know how to use a debugger. There may be cases where this is true, but it turns out that logging statements offer a number of advantages over a debugger session, and therefore the two approaches are complimentary.

First of all, the location and output of a logging statement is program-specific. Therefore, it can be permanently placed at a strategic location, and will output exactly the data we require. A debugger, as a general purpose tool, requires us to follow the program's control flow, and manually unravel complex data structures.

Furthermore, the work we invest in a debugging session only has ephemeral benefits. Even if we save our set-up for printing a complex data structure in a debugger script file, it would still not be visible or easily accessible to other people maintaining the code. I have yet to encounter a project that distributes debugger scripts together with its source code. On the other hand, because logging statements are permanent we can invest more effort than we could justify in a fleeting debugging session to format their output in a way that will increase our understanding of the program's operation and, therefore, our debugging productivity.

Finally, logging statements are inherently filterable. Many logging environments, such as the Unix syslog library, Java's util.logging framework, and the log4j Apache logging services, (http://logging.apache.org/) offer facilities for identifying the importance and the domain of a given log message. More impressively, Apple's OS X logging facility stores log results in a database and allows us to run sophisticated queries on them. We can thus filter messages at runtime to see exactly those that interest us. Of course, all these benefits are only available to us when we correctly use an organized logging framework, not simple println statements.

As you can see our tool-bag is full of useful debugging tools. Being an expert user of a debugger and a logging framework is a sign of professional maturity. So, next time you encounter a bug, select the appropriate tool, go out, and slay it.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Debuggers and Logging Frameworks. IEEE Software, 23(3):98–99, May/June 2006. (doi:10.1109/MS.2006.70)

Continue reading "Debuggers and Logging Frameworks"

2006.03.01

Bug Busters

Although only a few may originate a policy, we are all able to judge it.

— Pericles of Athens

Popular folklore has our profession's use of the word bug originating from a real insect found in an early electromechanical computer. Indeed, on September 9th of 1947 the Harvard Mark II operators did find a moth obstructing a relay's contacts. They removed it and dutifully taped it in the machine's logbook. However, engineers were using the term "bug" many decades before that incident. For example, in a 1878 letter Edison used the term referring to the faults and difficulties he was facing while moving from an invention's intuition to a commercialisable product.

One approach for dealing with bugs is to avoid them entirely. For example, we can hire only the best software engineers and meticulously review every specification, design, or code element, before committing it to a computer. However, following such an approach would be wasteful, because we would be underutilizing the many tools and techniques that can catch bugs for us. As Pericles recognized, creating a bug free artifact is a lot more difficult than locating errors in it. Consequently, although humans and program generators are seldom able to cast large-scale bug-free code from scratch, bug-finding tools are both abundant and successful.

In our field one important paradigm for eliminating bugs is the tightening of the specifications of what we build; in a similar context an industrial engineer would seek to reduce variability by manufacturing to tighter tolerances. At the level of the program code, we can tighten-up the specifications of our operations on different data types, of our program's behavior, and of our code's style. Furthermore, we can use many different approaches to verify that our code follows the specifications: the programming language, its compiler, specialized tools, libraries, and embedded tests are our most obvious friends here.

Languages

Modern programming languages do a great job in restricting many risky code constructs and expressions. First of all, structured languages (anything better than assembly language and old-style Fortran) prohibit, or at least impede, many programming tricks that can easily lead to unmaintainable spaghetti code. Even C, which provides us ample self-hanging rope with its support for goto and longjmp, doesn't allow arbitrary jumps across different functions. Once we properly indent our code we are also forced to split it into separate functions or methods: one would be mad to try to write code with more than a handful of indentation levels. This splitting eliminates bugs by promoting attributes like encapsulation and testability.

In addition, languages can often enforce correct behavior on our code. In Java, if a method can throw an exception, methods that call it will have to catch it or to declare that they may also throw that exception; in C# we can ensure that resources we acquire will be properly disposed by means of the using construct.

More importantly, languages with strong typing rules, can detect many problems at compile-time as data-type errors (adding apples to oranges). Obviously, errors we catch at compile-time won't appear when the program runs: this is an effective way to eliminate many bugs. For example, the introduction of generics into Java 1.5 allows us to specify that a list container will only house strings; our program won't compile if we attempt to store a value of a different type in it. In earlier versions of Java where the list would contain values of type Object—the least common denominator of all Java types—the error would manifest itself at runtime as a bug, when we attempted to cast an element retrieved from the list into a string.

Compiler Tricks

Even when the programming language allows us to write unsafe code, we can often ask the compiler to verify it for us. Many compilers will generate warnings when encountering questionable code constructs; we can save ourselves from embarrassing bugs, by actually paying attention to them. However, many of us, when we're working under a pressing deadline, tend to ignore compiler warnings. We can deal with this problem by using another commonly-supported compiler option that treats all warnings as errors: the code won't compile until we deal with all warnings.

We can also often help the compiler generate better warnings for us. Consider for example C's notoriously error-prone printf- and scanf-like functions. These functions require us to match the types specified in a format string with the supplied arguments. If we get this correspondence wrong our program may crash, print garbage, or, worse, open itself to a stack smashing attack. Some compilers will verify format arguments for the C library functions, but we often add our own functions with similar behavior, which the compiler can't check. For these cases, the GNU C compiler provides the __attribute__((format())) extension. We tag our own function declarations with the appropriate attribute, and the compiler will check the arguments for us.

Specialized Tools

Another way to eliminate bugs from our code is to pass it through one or more tools that explicitly look for problems in it. The progenitor of this tool family is lint, a tool Stephen Johnson wrote in the 70s to check C code for non-portable code and error-prone or wasteful constructs. For example, lint will flag the construct if (b = 0) as an error, complaining of an assignment in conditional context; we probably intended to write if (b == 0). Nowadays we can find commercial and open-source lint-like tools for many commonly-used languages. Some examples include CheckStyle, ESC/Java2, FindBugs, JLint, Lint4J, and PMD (covering Java), FxCop and devAdvantage (covering C#), and PC-lint (covering C and C++). Other tools specialize in locating security vulnerabilities—a class of bugs that stand out for their potentially devastating consequences. Tools in this category include Flawfinder, ITS4, Splint, and RATS.

Specialized tools can cover a lot more than what we could realistically expect a compiler to warn us about. For example, many tools will report violations of coding style guidelines, such as indentation and naming conventions. Furthermore, some tools are extensible: we can add rules particular to our own project (calls to launchMisile must be preceded by a call to openHatch), and we can precisely specify the rules that our project will follow. Integrating a code-checking tool into our build process, configuring its verification envelope, and extending it for our project becomes an important part of our development process. In some projects, a clean pass from the code-checking tools is a (sometimes enforced) prerequisite for checking code into the version control system.

Code

Finally, we can delegate bug busting to code. Many libraries contain hooks or specialized builds that can catch questionable argument values, resource leaks, and wrong ordering of function calls. As a prime example consider the C language dynamic memory allocation functions—a potent source of both bugs and of research papers describing versions of the library that can catch them. You can catch many of these bugs by using the valgrind tool, by loading the watchmalloc.so library (under Solaris), or by setting the MALLOC_CHECK_ or MALLOC_OPTIONS environment variables (under GNU/Linux distributions and FreeBSD, correspondingly).

In our own code we have even more options at our disposal. We can sprinkle our code with assertions, expressing preconditions, postconditions, and invariants. Any violation of them will trigger a runtime error, and help us pin down a possibly difficult-to-locate bug. At a higher level, we can instrument our classes with unit tests, using the JUnit testing framework or the equivalent for out environment. When churning out code, unit tests will identify many early bugs in it; later on, when we focus on maintenance activities, unit tests will ring a bell when we introduce new bugs.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Bug Busters. IEEE Software, 23(2):92–93, March/April 2006. (doi:10.1109/MS.2006.40)

Continue reading "Bug Busters"

2006.01.01

Project Asset Portability

It's said that real computer scientists don't program in assembler; they don't write in anything less portable than a number two pencil. Joking aside, at the end of the 1970s, the number of nonstandard languages and APIs left most programs tied to a very specific and narrow combination of software and hardware. Entire organizations were locked in for life to a specific vendor, unable to freely choose the hardware and software where their code and data would reside. Portability and vendor independence appeared to be a faraway, elusive goal.

Fortunately, several standardization efforts pinned down languages like C, C++, and SQL as well as interfaces like Posix and ODBC. So, if you're careful and not too ambitious, you can now implement large systems that will run on several different platforms, such as Windows, Linux, and Mac OS X. The Apache Web server, Perl interpreter, and TeX typesetting system are prominent examples of this reality. Furthermore, Java—with its standardized application binary interface and rich set of class libraries—is now conquering many new portability fronts. I routinely move Java programs, Ant build files, and HSQLDB scripts between different operating systems and processor architectures without changing an iota.

However, these victories, impressive as they are, shouldn't distract us from the fact that we're fighting the last war. Nowadays, a software system's program source code makes up only a small part of its assets stored as bits—taking up a larger percentage are specifications, design diagrams, application server deployment scripts, build rules, version history, documentation, and regression tests. Figure 1 shows the relative size of four asset classes that I was able to easily measure in the FreeBSD operating system. The figure shows that the source code is actually third in size after the version control system's historical data and the outstanding and solved issues database. (The different assets shown in the figure are stored in textual format, so their sizes are directly comparable.) Notice, that only the source code and the documentation—less than 25 percent of the total assets—are relatively portable between different tools. The version history and the issues are stored in tool-specific formats that hold the project hostage to their corresponding tools.

FreeBSD project assets pie chart
Figure 1. Relative size of different assets in the FreeBSD operating system project.

Where do you stand?

How is the situation in your organization? Can you easily move your design diagrams from one UML-based tool to another, change the repository where your specifications are stored, run your regression tests on a different platform, or switch to a different configuration management system? Unless your organization still uses paper records for these tasks (in which case your problems are in a totally different league), chances are you dread even the thought of changing the tools you use.

Yes, in modern organizations, tool flexibility is becoming increasingly important. We can't afford to have a project's assets locked in proprietary data formats: software projects often change hands through corporate acquisitions and reorganizations, and we're increasingly outsourcing development. If a project's new home lacks the corresponding tools or engineers trained to use them, development will most likely continue on the lowest common denominator—the source code. All effort put into specifications, design, configuration management, and testing will be lost. A similar thing happened in the previous era of nonstandard languages. Organizations sometimes had to run program binaries on top of thick layers of emulation because the new hardware lacked the compilers and tools required for working with the source code. Those organizations faced a rude shock when they had to verify and fix their software for year-2000 compliance.

An open market

The portability of a project's nonsource code assets means more than allowing those assets to move freely between different organizations and developers. Portability also means that a marketplace for tools can evolve without the artificial restrictions of the vendor lock-ins imposed by incompatible data formats and associated switching costs. Such an environment would let different tools compete on their actual technical merits, without the artificially cozy protection of their installed base's captive audience. Furthermore, engineers could experiment with different tools to evaluate their merits, and, hopefully, exploit interoperability, using multiple tools to profit from their complementary strengths. Competition might also lower the cost of the corresponding tools, making them accessible to a wider community of users.

Remember the Unix wars? Many, now defunct, Unix vendors damaged themselves and their entire industry as they fiercely battled to lock in their customers with proprietary extensions. We don't want the software tool industry to suffer a similar ordeal.

Realizing the importance of the portability of a project's non-code assets just gets us out the door. Our first step should then be to inventory those assets to appreciate the problem's extent. Next, we should devise and standardize simple, powerful, and comprehensive formats for moving these assets between different tools. To minimize the effect of vendors' "embrace, extend, and extinguish" tactics, we should organize interoperability exhibitions, where vendors could compete on how well their tools work with each other. We should aim to make moving project data in its portable format as painless as editing the same text file with two different editors. As a counterexample, I understand that the interoperability of UML design tools through XMI (XML Metadata Interchange) is woefully inadequate right now&mbash;we must do better than that.

Won't the standardization I'm proposing put an end to tool innovation? By the time you read this column, Marc Rochkind's paper, "The Source Code Control System," will be 30 years old. However, modern configuration management systems don't differ radically from Rochkind's SCCS. Issue management systems are also competing on trivialities. Let's face it: a whole generation of tools has matured. So, it's high time to end gratuitous differences in the project's asset data schema and interchange format and let the tools compete on stability, usability, performance, and data manipulation capabilities. Databases, compilers, and Web browsers have all flourished under a regime of standardized interfaces; its time to give our tools the same chance.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Project Asset Portability. IEEE Software, 23(1):100–101, January/February 2006. (doi:10.1109/MS.2006.28)

Continue reading "Project Asset Portability"

2005.11.01

Working with Unix Tools

A successful [software] tool is one that was used to do something undreamed of by its author.

— Stephen C. Johnson

Line-oriented textual data streams are the lowest useful common denominator for a lot of data that passes through our hands. Such streams can be used to represent program source code, web server log data, version control history, file lists, symbol tables, archive contents, error messages, profiling data, and so on. For many routine, everyday tasks, we might be tempted to process the data using a Swiss army knife scripting language, like Perl, Python, or Ruby. However, to do that we often need to write a small, self-contained program and save it into a file. By that point we've lost interest in the task, and end-up doing the work manually, if at all. Often, a more effective approach is to combine programs of the Unix toolchest into a short and sweet pipeline that we can run from our shell's command prompt. With the modern shell command-line editing facilities we can build our command bit by bit, until it molds into exactly the form that suits us. Nowadays, the original Unix tools are available on many different systems, like GNU/Linux, Mac OS X, and Microsoft Windows, so there's no reason why you shouldn't add this approach to your arsenal.

Many one-liners that you'll build around the Unix tools follow a pattern that goes roughly like this: fetching, selection, processing, and summarization. You'll also need to apply some plumbing to join these parts into a whole. Jump in to get a quick tour of the facilities.

Getting the data

Most of the time your data will be text that you can directly feed to the standard input of a tool. If this is not the case, you need to adapt your data. If you are dealing with object files, you'll have to use a command like nm (Unix), dumpbin (Windows), or javap (Java) to dig into them. If you're working with files grouped into an archive, then a command like tar, jar, or ar will list you the archive's contents. If your data comes from a (potentially large) collection of files, find can locate those that interest you. On the other hand, to get your data over the web, use wget. You can also use dd (and the special file /dev/zero), yes or jot to generate artificial data, perhaps for running a quick benchmark. Finally, if you want to process a compiler's list of error messages, you'll want to redirect its standard error to its standard output; the incantation 2>&1 will do this trick.

There are many other cases I've not covered here: relational databases, version control systems, mail clients, office applications, and so on. Always keep in mind that you're unlikely to be the first one who needs the application's data converted into a textual format; therefore someone has probably already written a tool for that job. For example, my Outwit tool suite (http://www.spinellis.gr/sw/outwit) can convert into a text stream data coming from the Windows clipboard, an ODBC source, the event log, or the registry.

Selection

Given the generality of the textual data format, in most cases you'll have on your hands more data than what you require. You might want to process only some parts of each row, or only a subset of the rows. To select a specific column from a line consisting of elements separated by space or another field delimiter, use awk with a single print $n command. If your fields are of fixed width, then you can separate them using cut. And, if your lines are not neatly separated into fields, you can often write a regular expression for a sed substitute command to isolate the element you want.

The workhorse for obtaining a subset of the rows is grep. Specify a regular expression to get only the rows that match it, and add the -v flag to filter out rows you don't want to process. Use fgrep with the -f flag if the elements you're looking for are fixed and stored into a file (perhaps generated in a previous processing step). If your selection criteria are more complex, you can often express them in an awk pattern expression. Many times you'll find yourself combining a number of these approaches to obtain the result that you want. For example, you might use grep to get the lines that interest you, grep -v to filter-out some noise from your sample, and finally awk to select a specific field from each line.

Processing

You'll find that data processing frequently involves sorting your lines on a specific field. The sort command supports tens of options for specifying the sort keys, their type, and the output order. Having your results sorted you then often want to count how many instances of each element you have. The uniq command with the -c option, will do the job here; often you'll post-process the result with another sort, this time with the -n flag specifying a numerical order, to find out which elements appear most frequently. In other cases you might want to compare results between different runs. You can use diff if the two runs generate results that should be the same (perhaps the output of a regression test), or comm if you want to compare two sorted lists. You'll handle more complex tasks using, again awk.

Summarizing

In many cases the processed data is too voluminous to be of use. For example, you might not care which symbols are defined with the wrong visibility in our program, but you might want to know how many there are. Surprisingly, many problems involve simply counting the output of the processing step using the humble wc (word count) command and its -l flag. If you want to know the top or bottom 10 elements of your result list, then you can pass your list through head or tail. To format a long list of words into a more manageable block that you can then paste into a program, use fmt (perhaps run after a sed substitution command tacks a comma after each element). Also, for debugging purposes you might initially pipe the result of intermediate stages through more or less, to examine it in detail. As usual, use awk when these approaches don't suit you; a typical task involves summing-up a specific field with a command like sum += $3.

Plumbing

All the wonderful building blocks we've described are useless without some way to glue them together. For this you'll use the Bourne shell's facilities. First and foremost comes the pipeline (|), which allows you to send the output of one processing step as input to the next one. In other cases you might want to execute the same command with many different arguments. For this you'll pass the arguments as input to xargs. A typical pattern involves obtaining a list of files using find, and processing them using xargs. So common is this pattern, that in order to handle files with embedded spaces in them, both commands support an argument (-print0 and -0) to have their data terminated with a null character, instead of a space. If your processing is more complex, you can always pipe the arguments into a while read loop (amazingly the Bourne shell allows you to pipe data into and from all its control structures.) When everything else fails, don't shy away from using a couple of intermediate files to juggle your data.

Putting it all together

The following command will examine all Java files located in the directory src, and print the ten files with the highest number of occurrences of a method call to substring.

find src -name '*.java' -print |
xargs fgrep -c .substring |
sort -t: -rn -k2 |
head -10

The pipeline sequence will first use find to locate all the Java files, and apply fgrep to them, counting (-c) the occurrences of .substring. Then, sort will order the results in reverse numerical order (-rn) according to the second field (-k2) using : as the separator (-t:), and head will print the top ten.

Appalled? Confused? Disheartened? Don't worry. It took me four iterations and two manual lookups to get the above command exactly right, but it was still a lot faster than counting by hand, or writing a program to do the counting. Every time you concoct a pipeline you become a little better at it, and, before you know it, you'll become the hero of your group: the one who knows the commands that can do magic.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Working with Unix Tools. IEEE Software, 22(6):9–11, November/December 2005. (doi:10.1109/MS.2005.170)

Continue reading "Working with Unix Tools"

2005.09.02

Version Control Talk Demystified

One indication of the importance an endeavor has in our lives is the vocabulary associated with it. If developers employ a tool or a method, inevitably they will come up with words to describe their corresponding work in an accurate and concise way. I recently heard a colleague describe version control systems (also formally known as configuration management tools) as boring. I hope that this dictionary will dispel this myth by documenting a rich technical and social vocabulary. If you don’t work with a VCS I believe this list will give you plenty of reasons to look at what these systems can do for you and your projects. On the other hand, if you already use a VCS I hope you will find ideas on how to use it more productively and how to improve your configuration management process. And, no matter to which group you belong to, I am sure you’ll find here some new words worth knowing.

annotate: A command used for listing the latest version of each program’s source code line together with the date and file version it was introduced, and person who committed it.

attic: The repository location containing files that do no longer belong to the head branch (CVS).

backout: To undo the effects of a commit, often by introducing a new commit that restores things to the state they were before. Developers who consider a change to be in the wrong direction, may ask its committer to backout that change. See also hostile backout.

backout war: See commit war.

Continue reading "Version Control Talk Demystified"

2005.09.01

Version Control Systems

A source code control system [is] a giant UNDO key—a project wide time machine.

— A. Hunt and D. Thomas

Sane programmers don't write production code without the help of an editor and an interpreter or a compiler, yet I've seen many software projects limping along without using a version control system. We can explain this contrast if we think in terms of the increased startup costs and the delayed gratification associated with the adoption of a VCS. We humans typically discount the future, and therefore implementing version control in a project appears to be a fight against the human nature. It is true that you can't beat the productivity boost that compilers and editors have provided us, but four decades after punched card programming in assembly language has gone out of fashion we must look elsewhere to reap our next gains in efficiency. And if you or your project is not using a VCS, adopting one may well be the single most important improvement you can undertake.

Procurement and installation

Acquiring a VCS need not be expensive; depending on the operating system you are using you may in fact find out that one (probably CVS or RCS) is already installed and ready to run. If not, you have the luxury to choose the system to use based on your budget. If you're on shoestring budget you can safely pick a free, open source system: some of these systems have been used for multi-million line projects for over a decade. If you can shell-out some cash, you will find that several commercial systems offer additional features and a more polished interface. Installation of the VCS typically also involves setting up the repository, the location where the definite version of your source code and its changes will reside. Be sure to include the repository in your scheduled backups.

Life with a VCS

Normal software development with a VCS is only marginally more complicated than without it. Initially you start out a new project, or import your existing project into the VCS. From then on, to work, you check-out a version of the project into a private directory. Every time you are happy with a change you've made—like a bug fix, or the addition of a new feature—you commit your change to the repository, accompanied with an explanatory message. Also, whenever you feel in the mood for some excitement you synchronize or update your private version of the software with the changes committed by your colleagues. This action will provide you with endless hours of fun as you battle against your colleagues' mistakes, but also ensures that you're all working on roughly the same source code base. Finally, when you roll-out a release, you label or tag all files with the release's name. And this is basically it.

The goodies

Now that you are convinced that adopting a VCS isn't a Herculean task, let us briefly see some of the benefits you will reap. First of all, if you are working in a team, you will stop stepping on each other's toes by writing over other people's code. If both you and Mary change the same file, the system will either unobtrusively merge your changes, or warn you that these are conflicting. In addition, every time you commit a change, you create a new version of the corresponding files. With the version information that the VCS stores you can access each file's history of changes, and you can see who changed which lines when. Now that you can always go back to a specific version of the file, you don't need to comment out code blocks, "in case they are needed in the future": your older version of the code is safely stored in the VCS repository. You can therefore see the differences between versions of the same file, and in many VCS implementations you can get an annotated listing of the file indicating the name and date of each line's most recent change. The repository also acts as the source of truth regarding the files stored in it. Source code distribution simply involves obtaining or updating a private workspace from the VCS repository. Once you label a project's files for a given release you can use the release's name to obtain again an exact copy of that historic file set. Furthermore you can split development into different branches each branch for example tracking the fixes associated with a given software release. You can then easily obtain the file versions associated with a given branch, and apply the same fix to multiple branches. Finally, with all the project's history neatly stored in the repository, you can mine the VCS data to see how you're doing: How many lines were changed for version 3.1? Which are the most and least productive days of the week? Which developers work on the same files?

Best practices

Even if you're already using a VCS for some time you may be able to squeeze more juice out of it. Here are some ideas.

Put everything under version control. Version control is not only for the source code; use it for you build scripts, help files, design notes, documentation, translated messages, GUI elements, everything that comprises your project.

Use VCS on your personal projects. You don't have to work on a team to adopt a VCS. Consider using a VCS for you personal files, like your hobby projects, your web page, or your phone book. Some developers even use a VCS to synchronize their home directories among different hosts.

Think carefully about file name and organization. Some VCSs get confused when a file name changes: you have the unattractive choice between loosing either the file's revision history, or the ability to retrieve older versions of the software with the correct file name. Therefore, it makes sense to adopt from the beginning of the project file names and a directory organization that will remain relatively stable through the project's life.

Perform one separate commit for every change. Do not lump multiple changes into a single commit. Separating changes allows you to see precisely which lines were affected by the change, and apply the change selectively to other branches. This rule is especially important if a change involves global stylistic changes, which will affect thousands of code lines.

Label all releases. Whenever you release the software (even to the testing group next door), label it. This provides everyone with a concrete name to associate with bug reports and their fixes.

Establish and follow policies and procedures. VCS actions can affect all developers. You will therefore benefit from clear policies covering developer etiquette or the content of commit messages, and procedures covering heavy operations, such as branching and releases.

$Id: tot-5.doc 1.4 2005/07/05 15:49:38 dds Exp dds $

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Version Control Systems. IEEE Software, 22(5):108–109, September/October 2005. (doi:10.1109/MS.2005.140)

Continue reading "Version Control Systems"

2005.07.01

Tool Writing: A Forgotten Art?

Merely adding features does not make it easier for users to do things—it just makes the manual thicker. The right solution in the right place is always more effective than haphazard hacking.

— Brian W. Kernighan and Rob Pike

In 1994 Chidamber and Kemerer defined a set of six simple metrics for object-oriented programs. Although the number of object-oriented metrics swelled to above 300 in the years that followed, I had a case where I preferred to use the original classic metric set for clarity, consistency, and simplicity. Surprisingly, none of the six open-source tools I found and tried to use fitted the bill. Most tools calculated only a subset of the six metrics, some required tweaking to make them compile, others had very specific dependencies on other projects (for example Eclipse), while others were horrendously inefficient. Although none of the tools I surveyed managed to calculate correctly the six classic Chidamber and Kemerer metrics in a straightforward way, most of them included numerous bells and whistles, such as graphical interfaces, XML output, and bindings to tools like ant and Eclipse.

As an experiment, I decided to implement a tool to fit my needs from scratch to see how difficult this task would be. In the process I discovered something more important than what I was bargaining for: writing standalone tools that can be efficiently combined with others to handle more demanding tasks appears to be becoming a forgotten art.

Going the Unix way

My design ideal for the tool I set out to implement was the filter interface provided by the majority of the Unix-based tools. These are designed around a set of simple principles (see Kernighan and Pike's The Unix Programming Environment (Prentice-Hall, 1984) and Raymond's The Art of Unix Programming (Addison-Wesley, 2003).

  • Each tool is responsible for doing a single job well.
  • Tools generate textual output that can be used by other tools. In particular this means that the program output will not contain decorative headers and trailing information.
  • Tools can accept input generated by other tools.
  • The tools are capable of stand-alone execution, without user intervention.
  • Functionality should be placed where it will do the most good.

Apart from the temptations I will describe later on, these principles are very easy to adopt. The 1979 7th Edition Unix version of the cat command is 62 lines long, the corresponding echo command is 22 lines long (double the size of the 1975 6th Edition version).[1] Nevertheless, tools designed following the principles I outlined easily become perennial classics, and can be combined with others in remarkably powerful ways. As an example the 6th Edition Unix, 9 lines long, 30 years old version of the echo command can be directly used, as a drop-in replacement, today in 5705 places in the current version of the FreeBSD operating system source code; we would need the 26 year old and the slightly more powerful 7th Edition version in another 249 instances.[2] Nowadays tools following the conventions we described are also widely available in open source implementations for systems such as Linux, Windows, *BSD, and Mac OS X.

Following the principles I described, the ckjm metric tool I implemented will operate on a list of compiled Java classes (or pairs of archive names followed by a Java class) specified as arguments or read from its standard input. It will print on its standard output a single line for each class, containing the class name and the values of the five metrics. This design allows us to use pipelines and external tools to select the classes to process, or format the output; refer to the tool's web site for specific examples.[3] Given ckjm's simplicity and paucity of features, I was not surprised to find it was both more stable and more efficient than the tools I was trying to use: by ignoring irrelevant interfacing requirements I was able to concentrate my efforts on the tool's essential elements.

Temptation calls

A month after I put the tool's source on the web I received an email from a brilliant young Dutch programmer colleague. He had enhanced the tool I wrote integrating it with the ant Java-based build tool and also adding an option for XML output. He also supplied me with a couple of XSL scripts that transformed the XML output into nicely set HTML. Although the code was well-written, and the new facilities appeared alluring, I am afraid my initial reply was not exactly welcoming.

The perils of tool-specific integration

Allowing the tool to be used from within ant sounds like a good idea, until we consider the type of dependency this type of integration creates. With the proposed enhancements the ckjm tool's source code imports six different ant classes, and therefore the enhancements create a dependency between one general purpose tool and another. Consider now what would happen if we also integrated ckjm with Eclipse and a third graphics drawing software package. Through these dependencies our ckjm tool would become highly unstable: any interface change in any of the three tools would require us to adjust correspondingly ckjm. The functionality provided by the imported ant classes is certainly useful: it provides us with a generalized and portable way to specify sets of files. However, providing this functionality within one tool (ant) violates the principle of adding functionality in the place where it would do the most good. Many other tools would benefit from this facility, therefore the DirectoryScanner class provided by ant should instead be part of a more general tool or facility.

In general, the ant interfaces provide services for performing tasks that are already supported reasonably well as general purpose abstractions in most modern operating systems, including Windows and Unix. These abstractions include the execution of a process, the specification of its arguments, and the redirection of its output. Creating a different, incompatible, interface for these facilities is not only gratuitous, it relegates venerable tools developed over the last 30 years to second class citizens. This approach simply does not scale. We can not require each tool to support the peculiar interfaces of every other tool, especially when there are existing conventions and interfaces that have withstood the test of time. We have a lot to gain if the tools we implement, whether we implement them in C, Java, C#, or Perl, follow the conventions and principles I outlined in the beginning.

The problems of XML output

Adapting a tool for XML output is less troublesome, because XML data solves some real problems. The typeless textual output of Unix tools can become a source of errors. If the output format of a Unix-type tool changes, tools further down a processing pipeline will continue to happily accept and process their input assuming it follows the earlier format. We will only realize that something is amiss if and when we see that the final results don't match our expectations. In addition, there are limits to what can be represented using space-separated fields with newline-separated records. XML allows us to represent more complex data structures in a generalized and portable way. Finally, XML allows us to use some powerful general-purpose verification, data query, and data manipulation tools.

On the other hand, because XML intermixes data with metadata and abandons the simple textual line-oriented format, it shuts out most of the tools that belong to a Unix programmer's tool bench. XSL transformations may be powerful, but because they are implemented within monolithic all-encompassing tools, any operation not supported becomes exceedingly difficult to implement. Under the Unix open-ended specialized tool paradigm, if we want to perform a topological sort on our data to order a list of dependencies, there is a tool, tsort, to do exactly that; if we want to spell-check a tool's output, again we can easily do it by adding the appropriate commands to our pipeline.

Another problem with XML-based operations is that their implementation appears to be orders of magnitude more verbose than the corresponding Unix command incantations. As a thoroughly unscientific experiment I asked a colleague to rewrite an awk one-liner I used for finding Java packages with a low abstractness and instability value into XSL. The 13-line code snippet he wrote was certainly less cryptic and more robust than my one liner. However, within the context of tools we use to simplify our everyday tasks I consider the XSL approach unsuitable. We can casually write a one-liner as a prototype, and then gradually enhance it in an explorative, incremental way, if the initial version does not directly fit our needs (according to Pareto's principle 90% of the time it will). Writing 13 lines of XSL is not a similarly lightweight task. As a result we have less opportunities to use our tools, and become proficient in exploiting them.

Finally, although adding XML output to a tool may sound enticing, it appears to be a first step down a slippery slope. If we add direct XML output (ckjm's documentation already included an example on how to transform its output into XML using a 13-line sed script), why not allow the tool to write its results into a relational database via JDBC—surely the end-result would be more robust and efficient than combining some existing tools. Then comes the database configuration interface, chart output, a GUI environment, a report designer, a scripting language, and, who knows, maybe the ability to share the results over a peer-to-peer network.

Realpolitik

The ant integration and XML output will be part of ckjm by the time you read these lines; probably as optional components. Emerson famously wrote that "A foolish consistency is the hobgoblin of little minds." Spreading ideology by alienating users and restricting a tool's appeal sounds counterproductive to me. Nevertheless, please remember the next time you ask or pay for tighter integration or a richer input or output format for a given tool, whether what you are asking for can (already) be accomplished in a more general fashion, and what the cost of this new feature will be in terms of the stability, interoperability, and orthogonality of your environment.

[1] http://minnie.tuhs.org/UnixTree/
[2] The 7th Edition version supports an option for omitting the trailing newline character. I derived both numbers in less than a minute by combining together seven Unix tools.
[3] http://www.spinellis.gr/sw/ckjm

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Tool Writing: A Forgotten Art? IEEE Software, 22(4):9–11, July/August 2005. (doi:10.1109/MS.2005.111)

Continue reading "Tool Writing: A Forgotten Art?"

2005.05.01

Java Makes Scripting Languages Irrelevant?

Simplicity does not precede complexity, but follows it.

— Alan J. Perlis

In computing we often solve a complex problem by adding another level of indirection. As an example, on Unix file systems an index node, or inode, data structure allows files to be allocated concurrently and sparsely, and yet still provide an efficient random access capability. When we want to customize large and complex systems or express fluid and rapidly changing requirements a common tool we employ is to add a scripting layer on top of the corresponding system. An early instance of this approach was employed in Dan Murphy's TECO editor developed on the DEC PDP-1 computer in 1962–63: its command language also doubled as an arcane (to put it politely) macro language.

About 20 years ago adding a scripting language interface to existing applications, which were at the time typically written in C, was all the rage. Lotus 123 supported macro commands, Framework had the FRED language, and AutoCAD and Emacs could be programmed in a form of Lisp. On the Unix platform system administrators wrote sophisticated sendmail configuration files to bridge—the at the time disparate and mutually incompatible—email networks. This was also the time when John Ousterhout developed Tcl/Tk as a general-purpose scripting language, to be integrated with any system that could benefit from such a capability. A few years later Microsoft came up with Application Basic, as its general-purpose scripting language for all its office productivity applications. All those early developments acquainted programmers with the notion of customizing applications through scripting, and opened the road for powerful general-purpose scripting languages such as Perl, Python, and Ruby (see John K. Ousterhout. Scripting: Higher-level programming for the 21st century. Computer, 31(3):23–30, March 1998). My impression is that with the evolution of Java and and Microsoft's .NET offerings (I'll use the term Java from now on as a stand-in for both alternatives) the niche occupied by scripting languages is rapidly shrinking; we are approaching the end of an era.

The application scripting languages I described serve an important purpose. Glued on an application they can greatly ease its configuration and customization and can allow end-user programming by offering a safe and friendly development environment. Those programming an application using its scripting interface do not have to bother with the intricacies of C's memory management, the mechanism used for managing character strings in the specific application, and the complexity of the application's internal data structures. Instead, the scripting language typically offer (among other things), automatic memory management, a powerful built-in string data type, sophisticated data structures, a rich repertoire of operations, and an intuitive API for manipulating the application's data and state. In addition, the application, by interpreting the scripting language, can isolate itself from undesirable effects of the scripting code, such as crashes and corruption of its data.

Rumors of the death of scripting languages ...

Notice, how most of the nice features applications obtain through the use of scripting languages are now offered by Java:

  • automatic memory management through garbage collection,
  • a standard string data type,
  • collection interfaces implementing most useful data structures, and
  • a very rich language library.

In addition, in applications written in Java what can be considered as an API already comes for free as part of their object-oriented design. One only needs to allow an application to dynamically load user-specified classes, expose its API by providing access to some of the application's objects, limit the application's exposure through the security manager and exception handlers, and the need for a separate scripting language vanishes.

In fact, many modern Java applications that support beans, plugins, and other extension mechanisms, follow exactly this strategy. Eclipse, Maven, Ant, Javadoc, ArgoUML, and Tomcat are some notable examples. Even on resource-constrained embedded devices, such as mobile phones, which are still programmed in a system programming language, configuration and customization is currently moving toward the Java direction.

... are greatly exaggerated

Does the trend of customizing applications through a Java interface make scripting languages irrelevant? Yes and no. As an application configuration and extension mechanism, Java is probably the way to go. The cost of marshalling and unmarshaling data objects and types between the application's code written in Java and the conventions expected by a different scripting language is too high for the limited incremental benefits that the scripting language would offer. On the other hand, scripting languages still have an edge in a number of areas, offering us a number of distinct advantages.

A more flexible or imaginative syntax. Think of Perl's numerous quoting mechanisms and its regular expression extension syntax, or Python's use of indentation for grouping statements. These make some program elements a lot easier to read. As an example, variable substitution within Perl's or the Unix shell's double quoted strings is by far the most readable way to represent a program's output.

Less fuss about types. Most scripting languages are typeless and therefore easier to write programs in. For example, Perl makes writing a client or server for an XML-based web service a breeze, whereas in Java we have to go through a number of contortions to implement the same functionality. Of course, the robustness and maintainability of code written in a typeless language is a different question, as many of us who maintain production code written in a scripting language later discover.

A more aggressive use of reflection. Consider here Perl's eval construct and Python's object emulation features. These allow the programmer to construct and execute code on the fly, or dynamically change a class's fields. In many cases these features simplify the construction of flexible and adaptable software systems.

Tighter integration with command-line tools. Although Java 1.5 comes with an API containing over 3000 classes—with thousands more being available through both open source and proprietary packages—many operations can still be performed in a more reliable and efficient manner by interfacing with venerable command-line tools. The Unix scripting languages provide many facilities for combining these tools, such as the creation of pipelines, and the processing of data through sophisticated control constructs.

Viability as a command language. Many scripting languages, such as the ones of the operating system shells, can also double as a command language. Command-line interfaces often offer a considerably more expressive working medium than GUI environments (we'll expand on that in another column). Coupling a command-line interface with a scripting language means that commonly executed command sequences can easily be promoted into automated scripts; a boon to developers. This coupling also encourages an exploratory programming style, which many of us find very productive. I often code complex pipelines step by step, examining the output of each step, before tacking another processing element at the pipeline's end.

A shorter build cycle. Although for many systems a build cycle that provided time for an elaborate lunch is now sadly history, the tight feedback loop offered by the lack of a compilation step in scripting languages allows for rapid prototyping and exploratory changes, often hand-in-hand with the end-user. This is a feature that those using agile development methodologies can surely appreciate.

So, where do we stand now? The gap between system programming languages and scripting languages is slowly closing. For example, some scripting languages are capitalizing on Java's infrastructure by having their code compile into JVM bytecode. However, there is still a lot of ground in the middle that is up for the grabs. New system programming language designs can offer more of the advantages now available only through scripting, while scripting languages are constantly benefiting from hardware performance advances that make their (real or perceived) efficiency drawbacks less relevant every day. The issue of the result's quality remains an open question on both fronts.

We developers, as avid tool users, enjoy viewing the battle from atop, reaping the benefits.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Java Makes Scripting Languages Irrelevant? IEEE Software, 22(3):70–71, May/June 2005. (doi:10.1109/MS.2005.67)

Continue reading "Java Makes Scripting Languages Irrelevant?"

2005.03.01

Dear Editor

Machines should work. People should think.

— Richard Hamming

Dear Editor,
I know that you are nowadays often taken for granted, and that many programmers consider you a relic of an older age. Yet, programmers continue to spend an inordinate amount of time with you, and often listen to your advice. As you have no doubt observed I am often mistreated; in this letter I have written my most common grievances hoping you can convince them programmers to behave better toward me in the future.

I know that in the past your position was a lot more important. Developers used to fight over which of your two cousins vi and emacs was the most versatile. Creating an editor was a no mean feat; the (now famous) programmers who brought your cousins to life, Bill Joy and Richard Stallman, had to overcome the limitations of a small address space, slow terminal lines and CPUs, as well as the idiosyncrasies of the numerous mutually incompatible terminals. Today, the pervasiveness of GUI libraries, fast processors, and the abundance of memory space, make the development of an editor a weekend project.

Yet you are not less important than you were 20 years ago. One of the best things you can do for programmers (and, incidentally, us programs) is to convince them to use their head, taking advantage of your advanced facilities, instead of their fingers. This switch will first of all reduce their risk of suffering from repetitive stress injury (RSI). If they can accomplish the effect of 100 keystrokes by giving you a 20 character search and replace command, they have saved their fingers from the impact force of 80 keystrokes. Furthermore, by devising complex commands, instead of repetitively typing, they remain attentive, rather than bored, and, frankly, I trust you a lot more than the programmers in performing repetitive actions. Finally, and most importantly, each time programmers think of the way to automate a complex editing task by giving you an appropriate command they sharpen their mental skills. In contrast to their finite typing capacity, their mental skill appear to be infinitely expandable; over the years I have encountered some programmers who were appearing to perform magic with their editor.

My dear editor, let me give you some examples of how expert programmers let their brain work instead of their fingers. If one of my methods contains variables x1, y1, x2, y2, a deft programmer using Eclipse will look for any of them using the regular expression \b[xy][12]\b (match a word boundary, one of x or y, followed by a 1 or 2). Changing the variable names by adding an underscore in-between the letter and the digit is a bit more complex, but this is exactly what I mean by thinking instead of typing. Programmers using vi would type :s/\<\([xy]\)\([12]\)\>/\1_\2/g (substitute a word consisting of x or y, followed by 1 or 2 with the first matched pattern—\1—, followed by an underscore, followed by the second matched pattern—\2). In some cases where a regular expression-based substitution command becomes too complex, a useful pattern I have seen used involves searching for the code element using a regular expression search, but performing the replacement using the editor's "repeat last command" feature.

Of course, we both know that an editor is not always the perfect tool for modifying programs. External tools can also come handy, and for this I really appreciate editors that can pipe a range of my body through an external filter. For example, if the order of two elements is reversed in a structure, the initialization data can also be reversed by piping it through the awk {print $2, $1} one-liner.

I do not however consider all automation beneficial. Many programmers use your auto formatting facilities to beat us programs into shape. This is however highly inconvenient for us. First of all, your auto-formatting facilities are not a substitute for good taste. Often, by judiciously adding some space in one of my code blocks I can become easier to understand and maintain. The blind application of auto-formatting can destroy carefully laid out code elements. In addition, every time I pass through a version control system with a new format, thousands of unimportant changes are logged, confusing the programmers that will maintain me in the future. Finally, auto-formatting introduces another problem into the development process; a risk-analyst would call it overcompensation. Programmers, confident that you, the editor, will handle all the formatting tasks for them completely neglect formatting us programs, leaving us in a state worse than one we would be without your help. Scientists have observed this phenomenon in the real world: after the introduction of safety caps in medicine bottles parents neglected locking medicine cabinets and accidental child poisonings actually increased. About a week ago I overhead with horror a programmer commenting that he didn't precisely know Java's formatting guidelines, because his editor was handling formatting for him.

I hear you say: "you are asking me to stop doing my work. I simply can not sit idle watching the programmer hack you programs to death!" My dear editor, don't worry, there is plenty of useful work you can do. Syntax coloring is one important function. Through syntax coloring programmers can easily identify keywords, variables, constants, and comments. They can often also spot silly syntax errors (such as a missing quote), avoiding the distraction of an unproductive compile-edit cycle. This brings me to another useful service you should be providing: error highlighting. Identifying a missing operator or semicolon, will also help in the same way, as long as you do it in a correct and unobtrusive manner. Don't distract programmers with false alarms while they write a statement, and never highlight errors that aren't. False error reports will make programmers simply switch this useful feature of yours off. A deep understanding of the language we're written in will also help you better serve programmers. Most modern languages follow a block structure identified by indentation; in some languages (such as Python, Occam, and Haskell) indentation is even semantically significant. An editor that will not allow programmers easily increase and decrease the indentation level of our code blocks, is simply not suitable for programming. Another language-specific service you can offer is the marking of matching delimiters (brackets, braces, square and angle brackets, and, dare I suggest it, XML tags). Of course, I know that some of your kind go even further and provide complete refactoring support: changes of variable names and method signatures, field encapsulation, the extraction of local variables and constants, and movement of my code elements. I am all for that; things that can be done automatically in a reliable way, allow programmers to spend more quality time with me.

You as an editor, should also help programmers navigate within their increasingly complicated environment. By providing on-line help for API elements and convenient facilities for browsing my code's structure I can be less fearful of growing fat and ugly by programmers who will reinvent the wheel instead of using an API feature or one of my existing classes or functions.

I feel I've been babbling for far too long, so I will close this letter with a few words on a facility I am sure you are really proud of: editor macros. I am sorry to tell you, but from my experience, these macros are often signs of a hidden design deficiency. If a programmer has to use a macro to change a program's configuration setting, then the program is not flexible enough. I've observed that in most cases where programmers use macros to save repetitive typing, they are programming at the wrong abstraction level.

So, dear editor, please try to impart to our programmer friends the following advice: don't type what you can automate in the editor environment, and don't use the editor facilities for what you can code.

Sincerely,
A Program

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Dear Editor. IEEE Software, 22(2):14–15, March/April 2005. (doi:10.1109/MS.2005.36)

Continue reading "Dear Editor"

2005.01.01

The Tools at Hand

The tools we use have a profound (and devious!) influence on our thinking habits, and, therefore, on our thinking abilities.

— Edsger W. Dijkstra

With a shovel excavator a single operator can effortlessly move 720 tons of earth with a single movement; a VLSI fabrication plant allows a designer to create elaborate sub-micron structures. Without tools the thousands employed in a car factory are nothing, with tools they can assemble a car in 18 effort hours. Sometimes, tools can even subsume the importance of their operators. The violinist Ivry Gitlis, considered one of the most talented musicians of his generation, said of his Stradivarius: "I have a violin that was born in 1713. I don't consider it my violin. Rather, I am its violinist; I am passing though its life." Tools are clearly an important and defining element of any profession and activity: tools help us move boulders and atoms, tools help us reach the Moon and our soul.

The objective of this new IEEE Software column is to explore the interplay between software development and the tools we apply to the problem. Skilled craftsmen set themselves apart from amateurs by the tools they use and the way they employ them. As a professional I feel I am getting a tremendous boost in my productivity by appropriately applying tools to the software construction problems I face every day. I also often find myself developing new tools, both for my personal use and for wider distribution. Column installments will discuss specific software construction activities from the standpoint of the tools we can employ, the tools of our trade. Specific topics I am planning to address include: editing, compiling, documentation, debugging, testing, configuration management, issue tracking, the development environment, tool building, and domain-specific tools. Of course, this is your column as much as it is mine, so I am always open to suggestions for different topics or viewpoints.

Under-spending on Development Tools

So, how do our tools of the trade measure up? Pathetically, by many measures. Although the software industry is large and, dare I say it, mature, the software tool industry is still in its infantile stage. This becomes readily apparent if we consider the cost of the tools we use. A 720 ton-rated shovel excavator is so expensive that the company selling it also provides financing. The cost of the VLSI fabrication plants effectively dictates the product cycles of the manufactured chips. In comparison, the tools we use for software development cost at most a few thousands of dollars per seat. Economists track capital expenditures as a way to judge the economic future of a country or sector. On the radar screen of these statistics the cost of software development tools wouldn't amount to a single blip.

Table 1. Capital expenditures in different industries
Industry Revenue ($ bn) Capital Expenditure ($ bn) CE / R (%)
Semiconductors430,36099,57723.1%
Motor vehicles1,094,15790,0428.2%
Heavy (non building) construction143,9576,1874.3%
Prepackaged software105,3563,4023.2%
Programming services18,2164382.4%

To substantiate the claim of capital under-spending in our industry I used the COMPUSTAT global database to compare the capital expenditures of some industries we software engineers often admire and look to as role models, against our own. Look at the numbers appearing in Table 1. The capital expenditures of the semiconductor industry amount to 23% of its revenue; this is how it succeeds in following Moore's law for more than 30 consecutive years. The robotic factories of the car industry, envisioned by the proponents of software assembly plants, soak up 8% of its revenues. Even the nomadic heavy construction industry—our perennial favorite when we compare software engineering to bridge building—spends on capital equipment nearly a double percentage of its revenues than that spent by our own custom software construction (programming services) firms.

I hear you say that the economies of software are different: software can be duplicated at a zero marginal cost, therefore the low cost of tools reflects the realities of their distribution, rather than their intrinsic value. I only wish this was true, that we are all buying expensively developed tools at rock bottom prices. I can vouch from personal experience that the effort our industry puts into developing software development tools is apparently miniscule. A couple of years ago I developed UMLGraph, a prototype of a declarative UML diagramming tool, and made it available over my web site. I wrote the first version of this tool over a single weekend, yet I regularly receive email from enthusiastic users. This fact definitely does not reflect on my programming brilliance, but says a lot about the state of the art in diagramming software and the amount of cash employers are willing to spend on purchasing diagramming (and conceivably other software development) tools.

What would happen if an established tool vendor with deep pockets decided to build a software development tool by investing the kind of money associated with a chip plant? (Mind you, I recognize the difference between chip production and software design, my argument concerns capital expenditures over the entire product life cycle.) According to Intel financier Arthur Rock, the cost of capital equipment to build semiconductors doubles every four years. Currently a chip plant under construction costs over $2.5 billion. To put this number in perspective, consider it represents about 13,000 software development effort-years. This is almost three times the effort invested in the development of OS/360 (5000 effort-years) and, according to my calculations, almost the same as the effort invested in the development of the Windows NT line, up to and including Windows 2000. Investing this kind of money on a design tool could buy us round trip model-based software development that actually worked under realistic conditions. Investing this kind of money on a compiler could buy us type checking integrated across the presentation, application logic, and database layers, or the ability to generate provably correct and efficient code. We could also have at our hands debuggers that would be able to execute the program forwards and backwards, editors that would let us navigate between diagrams and source code, effortlessly performing sophisticated refactoring operations, and infrastructure to test an application's GUI delivered as part of our IDEs. To get a picture of the lag between what is theoretically possible, and what tools provide in practice, scan the conference proceedings of the last five PLDI and ICSE conferences, and see how few of the results reported there are now available to developers for everyday use.

Tools Underused

As if our under-spending on software development tools was not worrisome enough, a related problem in our profession is our failure to use the most appropriate tools for a given task. Here is my list of the Ten Software Tool Sins.

10. Maintaining the source code's API documentation separately from the source code.
9. Failing to integrate automated unit testing in the development process.
8. Using paper forms, email folders, and Post-it® notes to track pending issues.
7. Painstakingly analyzing the effects of a source code change in cases where the compiler and the language's type system can more reliably do the job.
6. Refusing to learn how existing tools can be made to work together through scripting or a shell interface.
5. Ignoring or (worse) silencing compiler warning messages.
4. Maintaining isolated copies of the source code base for each developer, and performing software configuration management using email attachments or the trendy new technology of USB dongles.
3. Locating definitions of program entities through a mixture of guesswork and sequential scanning through the source code.
2. Adding temporary print statements in the source code instead of using a debugger.
1. Performing mechanical repetitive editing operations by hand.

I often spot mature developer colleagues committing the number one offence in the list by the sound of their keyboard: a repetitive click-clack-clack, click-clack-clack, click-clack-clack typing pattern gives the game away. This sin is inexcusable, as (free) editors with sophisticated text processing capabilities have been available for over 30 years. Other sins, such as the number two in the list, are admittedly a mixture of tool immaturity and developer laziness. The Linux 2.4 kernel contains 65000 printf/printk statements, the FreeBSD kernel another 17000. Many of these statements can be explained by the poor support of most debuggers for embedded and system software development; a shortcoming that is becoming increasingly important as more and more software is developed for embedded devices. In my experience, many other sinful habits can be traced back to our university days. In academia the dirty mechanics of software development are often regarded as a less than respectable activity. Software tools get in the way when teaching the "Introductory Programming" course, and would take valuable time away from discussing lofty theories when teaching "Software Engineering". Students are therefore left on their own, many graduating and still writing their software with Windows Notepad.

The Silver Lining

There is really no need to end this column in a sullen mood. Software is a great lever. The little our industry has invested in tool development has provided us with numerous admirable and sophisticated tools. The many volunteers working on free and open source software projects are further increasing our choices for mature development environments and tools. It is up to us to make the best of what is available, and, why not, contribute back to the community.

* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. The tools at hand. IEEE Software, 22(1):10–13, January/February 2005. (doi:10.1109/MS.2005.23)

Continue reading "The Tools at Hand"


Creative Commons License Last update: Wednesday, May 29, 2019 5:47 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-Share Alike 3.0 Greece License.