Zombie Zen

How Structure Affects Git's UX

By Ross Light

It’s always interesting to me to compare different approaches to solving the same problem. Git and Mercurial are two version control systems that came out at similar times, trying to address very similar requirements. Git came from a very low-level systems perspective, whereas Mercurial spent a lot of effort on its user experience. Despite what you might think, their data models are remarkably similar. It’s from this observation I started my side project — gg. I found myself greatly missing the experience of Mercurial, but I’ve resigned myself to the fact that Git is here to stay.

I came across a rather interesting challenge today while working on gg. I am trying to replicate the behavior of hg pull, and even though I’ve worked on gg for over a year now, I still haven’t reached a behavior that I’m satisfied with. I’ve finally realized why, and it boils down to a very subtle difference in the data models of the two systems.

A diagram showing commits contained inside a box titled "All", each leaf having a ref attached. One commit with no ref is outside this box, but still in the Git repository.

The Git model. In order to be part of git log --all, it must be reachable from a ref. However, there can be ephemeral commits not reachable from a ref in the repository.

At its core, Git has two concepts: commits and refs. A repository stores commits in an unstructured graph form. Git considers any commit not reachable from a ref to be unimportant, and will delete such commits periodically. Such commits are typically the side effect of rebases or deleted local branches. Refs act like garbage collection roots, and in fact, Git refers to this as garbage collection. (Not the sort of judgement you want from a system safeguarding your code!)

A diagram showing all commits contained inside the Mercurial Revlog.

The Mercurial model. By merit of being in the revlog, a revision is considered part of the repository’s history.

Mercurial, on the other hand, makes no such distinction. All commits in the revlog are part of the observable history. Mercurial bestows a fairly arbitrary default ordering on the commits, which is the order in which they were appended to the revlog. Mercurial does have a concept similar to refs called bookmarks, but this concept was introduced later and thus isn’t fundamental. In Mercurial, one must explicitly remove commits from the revlog using an extension like hg strip or mark them as obsolete with the new changeset evolution features. This is conceptually very simple: once you see a commit, it is now part of the history.

This difference in data model causes complexity higher up in Git. When fetching commits from another repository, it is very difficult to do so without configuring what Git calls “remotes”. Fundamentally, the problem is that when git fetch adds commits to the repository, it needs to attach refs to those new graph leaves lest they be garbage collected. Remotes specify patterns for Git to use to create refs, thus preserving the new commits. Mercurial does not have this problem! In Mercurial, it simply appends the commits to the revlog, and there’s nothing more to do.

Git’s choice to distinguish between reachable and unreachable commits resulted in more complexity for the end-user in the form of configuration. Mercurial made no such distinction and thus no configuration is required. The lesson I take away from this is that it is important to constantly reevaluate your data model as you build software. While this is certainly not a new observation — Rob Pike famously wrote “Data structures, not algorithms, are central to programming.” — it is one that cannot be stated enough.