This post is the first in a series that outlines my views on the problem of versioning in word processors. I’ll update this post with links to subsequent posts as they’re published.
When multiple people are collaborating on a document, it’s a common requirement for a given person to be able to see what changes others have made, and to approve, reject, and comment on those changes. There’s two main approaches to doing this.
The first approach is change tracking. This involves storing all the content of the original and modified versions in a combined form within a single file, with markers present to indicate where insertions and deletions have occurred, and by whom. The files are often sent around via email, which is simple to work with and understand. Change tracking is well-supported by Microsoft Word.
The problem with change tracking is that you can easily end up with multiple versions of a file laying around on different people’s hard drives, email inboxes, and network shares, and it’s not always obvious which is the “latest” version. Furthermore, if the author of a document makes further changes after sending a previous version out for review, then the modified versions they get back from those they have sent it to are often not easy to integrate into the author’s own updated version. In practice, any time you have multiple people working on a document concurrently, this problem can arise.
The second approach is version control. This involves storing multiple versions (or revisions) of the file as a whole, without any information about changes included within the document versions themselves. Rather, this information is stored separately in a repository – a database which stores all individual versions of a document, and the derivation relationships between them. Popular version control systems include Git and Subversion.
Determining what changes have been made between different versions can be done automatically by software. Tools such as ‘diff’ take two documents as input, and output a description of the changes that have occurred, in a format equivalent to what one might see in a document containing change tracking information. Where there are two or more versions derived from the same original, a process called a three-way merge can be used to incorporate all the changes, potentially marking some parts of the resulting document as “in conflict” because two mutually-exclusive changes have been made to the same part of the file.
While change tracking is the conventional approach for word processing, version control is the conventional approach for software development. The latter is significantly more powerful, due to the needs of developers — it’s not uncommon for there to be tens or hundreds of people involved with a project containing many thousands of files. This requires sophisticated tools and a higher level of skill to work with, but is inherently necessary for the task at hand (try to find a software company or open source project that doesn’t use version control in some form). The same principles can — and I argue, should — be applied to word processing documents. Version control provides a solution to the problems with change tracking mentioned above.
In part 2, I’ll discuss the two types of version control — linear and non-linear — and how they correspond to the workflows people typically use when working with word processing documents.