blog dds: 2012-05-17

Even by our field’s dizzying rate of progress I wouldn’t expect to revisit the subject of version control just six years after I first wrote about it in this column (Version Control Systems. Software, 22(5):108–109, September/October 2005). Yet here we are. The new kid on the block is git, a distributed revision control system available on all mainstream development platforms through a Free Software license. Git, a brainchild of Linus Torvalds, began its life in 2005 as the revision management system used for coordinating the development of the Linux kernel. Over the years its functionality, portability, efficiency, and third-party adoption have evolved by leaps and bounds to make it its category’s leader. (Two other systems with similar characteristics are Mercurial and Bazaar.)

Revisions, not versions

Traditional version control systems derive their requirements from software configuration management practices. The focus of these practices is to identify, control, and disseminate the software’s configuration and changes. A system, like CVS or Subversion, that can retrieve the files corresponding to a specific software version, list the changes that led to it, and keep developers from trampling on each others’ feet satisfies these requirements and can be a boon over exchanging files through a shared folder or email. However, configuration management mainly prevents bad things from happening during software development; it provides (valuable) control, but few tools that genuinely aid a developer’s everyday life.

Developers don’t really care whether they work on version 8.2.72.6 of branch RELENG_8_2, but care deeply about software revisions: changes they made to fix a specific bug, infrastructure changes that were needed to support that fix, another set of changes that didn’t work out, and some work in progress that was interrupted to work on that urgent bug fix. Fittingly for a tool written by a programmer to scratch his own itch, git supports these needs with gusto. It gives developers a complete copy of the software repository, allowing them to create their own private branches corresponding to their individual needs. Each branch can correspond to a distinct task, like the development of a new feature or a bug fix. Developers can quickly create and delete branches, switch from one working branch to another, make small incomplete incremental commits, cherry-pick commits from other branches or commits, or even stash away some changes to revisit them later. When a feature is mature for wider distribution they can package their changes as a complete well-integrated change set that others can merge into their work.

An important difference of git over its older ancestors is that it elevates the software’s revisions to first class citizens. By managing revisions git allows a developer to select precisely which ones will comprise an integrated change, down to partial changes within a single file. More importantly, git keeps as a graph a complete history of what changes have been merged into which branches, thus allowing developers to think in terms of revisions they have integrated rather than low-level file differences between diverging branch snapshots. This switch to a higher level of abstraction is no less dramatic than the one from assembly language, which dealt with CPU registers and memory addresses, to high level programming languages, which provide entities like objects, containers, and threads. As one would expect, a higher level of abstraction provides opportunities for changing the way we think and work.

Decentralized revision control

By managing revisions git makes it natural and easy to push a revision to a remote repository (remember, each developer has a separate complete repository copy) or to pull some revisions from a remote repository to the local one. This in turn allows developers and their managers to build a variety of interesting workflows, most of which are impossible to run on a traditional centralized version control system. For instance, an integration manager can selectively pull changes from the developers’ public repositories and integrate them into a master repository that contains the project’s definitive picture. If the workload on the integration manager becomes excessive a series of “lieutenants” can take over the integration of specific project parts. The lieutenants integrate the developer changes in their public repositories and a higher level manager can then take those larger change sets and integrate them into the master repository. (This is the Linux kernel development model.) Or two developers can coordinate and share their work in a peer-to-peer fashion by pulling from each other’s repository.

The importance of being local

Unfortunately for this column’s focus, there’s more to git than its superb management of revisions and decentralized repositories. First, by keeping locally a complete version of a repository, git allows you to work and commit individual changes without requiring internet connectivity. This local staging area also makes it possible for you to edit, reorder, and squash together your past commits (rebase in git’s parlance) to present a coherent story to the outside world. When you’re back online you can push your changes to a remote repository. The project’s past history is also always available to you. Want to see who fixed a specific bug while travelling at thirty thousand feet? Go through the project’s commit history; it’s there. Want to examine how the bug was fixed? The corresponding changes are also one command away. Furthermore, the local repository (and, no-doubt, some highly-skilled programming) makes all operations blindingly fast. This is a blessing for your personal productivity, but it’s also an enabler for performing more complex operations. For instance, building on the rapid repository access, git’s bisect command allows you to perform a binary search between two points in time to find the commit that broke your software. Finally, local repositories make it trivial to put even the smallest personal project under version control. Just enter “git init” at the directory where your project resides and you’re ready to go. When later on you want to share the project with others, you can easily associate it with a public remote repository and push there all your changes. This (plus git’s ability to import history from other version control systems) has allowed me to share work that precedes git’s inception.

The GitHub factor

If the idea of setting up a public repository, maintaining its servers and connectivity, keeping it secure and up-to-date, setting up user accounts, and supporting your users isn’t appealing, then you can delegate all this to a third party provider. GitHub is the most well-known, but there are at least eight others that offer similar functions. GitHub simplifies many repository management tasks through a web-based user interface. In addition, it promotes cooperation in open-source projects, which are hosted for free, by making it easy for developers to clone existing projects and submit their contributions as a pull request. If you decide to pay in order to host a proprietary project on GitHub, then you’ll value the ability to setup teams with varying access rights across the project’s repositories. GitHub also provides an issue tracking system, a file download area, and Gollum, a git-based wiki. Through Gollum you can edit a page on the web and record the change as a git commit, but you can also perform manual or automated changes on the files of a local wiki clone and then push them onto an upstream repository. This gives you wiki-style effortless collaboration with git’s workflow sophistication; what more could one want?

^* This piece has been published in the IEEE Software magazine Tools of the Trade column, and should be cited as follows: Diomidis Spinellis. Git. IEEE Software, 29(3):100–101, May/June 2012. (doi:10.1109/MS.2012.61)

Comments Post Toot! Tweet Share