This post is the first in a series that outlines my views on the problem of versioning in word processors. I’ll update this post with links to subsequent posts as they’re published.
When multiple people are collaborating on a document, it’s a common requirement for a given person to be able to see what changes others have made, and to approve, reject, and comment on those changes. There’s two main approaches to doing this.
The first approach is change tracking. This involves storing all the content of the original and modified versions in a combined form within a single file, with markers present to indicate where insertions and deletions have occurred, and by whom. The files are often sent around via email, which is simple to work with and understand. Change tracking is well-supported by Microsoft Word.
The problem with change tracking is that you can easily end up with multiple versions of a file laying around on different people’s hard drives, email inboxes, and network shares, and it’s not always obvious which is the “latest” version. Furthermore, if the author of a document makes further changes after sending a previous version out for review, then the modified versions they get back from those they have sent it to are often not easy to integrate into the author’s own updated version. In practice, any time you have multiple people working on a document concurrently, this problem can arise.
The second approach is version control. This involves storing multiple versions (or revisions) of the file as a whole, without any information about changes included within the document versions themselves. Rather, this information is stored separately in a repository – a database which stores all individual versions of a document, and the derivation relationships between them. Popular version control systems include Git and Subversion.
Determining what changes have been made between different versions can be done automatically by software. Tools such as ‘diff’ take two documents as input, and output a description of the changes that have occurred, in a format equivalent to what one might see in a document containing change tracking information. Where there are two or more versions derived from the same original, a process called a three-way merge can be used to incorporate all the changes, potentially marking some parts of the resulting document as “in conflict” because two mutually-exclusive changes have been made to the same part of the file.
While change tracking is the conventional approach for word processing, version control is the conventional approach for software development. The latter is significantly more powerful, due to the needs of developers — it’s not uncommon for there to be tens or hundreds of people involved with a project containing many thousands of files. This requires sophisticated tools and a higher level of skill to work with, but is inherently necessary for the task at hand (try to find a software company or open source project that doesn’t use version control in some form). The same principles can — and I argue, should — be applied to word processing documents. Version control provides a solution to the problems with change tracking mentioned above.
In part 2, I’ll discuss the two types of version control — linear and non-linear — and how they correspond to the workflows people typically use when working with word processing documents.
Yes, I agree versioning should be everywhere! http://www.cregox.com/blog/2014/12/8/time-to-focus-on-versioning
And definitely applying it to wysiwyg is quite a challenge. But I still find it odd even today there’s not a single word processor capable of doing it. Maybe it’s just too much to do! :O
PS: I think there is an important aspect of version-control that would be valuable baked into change-tracking, and that is preservation of change history. It also seems important with regard to collision resolution in collaboratively-edited documents. I’m a bit intrigued by how this works with Git, and wonder if that provides a metaphor that translates to multi-part XML-based document representations such as those using ODF and OOXML packages.
Yes, I think that history is an extremely important aspect, particularly when one wants to audit all changes to a document and get a guarantee that what is presented has being changed is *actually* what has changed (it’s very easy to hide extra changes in a change-tracked document by manipulating the XML). This is particularly relevant for legal contracts and policy documents.
One issue I plan to cover in a future post is the way in which one can convert between the two paradigms. For example, one could take a word document with multiple sets of changes from different people or editing sessions, and mechanically transform it into a Git repository. And one could also go the other way by taking a Git repository, linearising the history graph, and producing a Word document that includes a representation of all changes that have ever been made (or have been made between two specified revisions).
More thoughts to come on this soon 🙂
Hi Peter. Interesting approach. I have been looking at the change-tracking end of this in the context of ODF (http://nfoworks.org/rct/). I think diffing is harder for WYSIWYG formats.
I encourage you to check out DChanges 2014 and the call for positions papers and work-in-progress summaries at that workshop, to be held in Colorado in mid-September. Here’s one of the requests that was sent around: https://lists.oasis-open.org/archives/office-comment/201406/msg00001.html
Thanks Dennis,
I was made aware of this recently and it definitely sounds like an interesting conference. Unfortunately I won’t be able to make it this year, but will be certainly looking at the proceedings with great interest.
My thoughts on this matter are still forming, and I’ve had some interesting discussions with others recently about it. I hope to attend the workshop next year and present there.