Update on Microsoft Word Support, Part 3

Yes, more delays.

Rather than another long-winded post, I figured that this time I’d just give an example of how HTML approaches things compared to Word. Hopefully this will illustrate why I’m such a strong proponent of the former.

Specifying an image in HTML

<img src="filename.png" width="66%">

Specifying an image in docx (method 1)

<w:drawing>
  <wp:inline>
    <wp:extent cx="5270500" cy="3513667"/>
    <wp:docPr id="1" name="Picture 1"/>
    <wp:cNvGraphicFramePr>
      <a:graphicFrameLocks noChangeAspect="1"/>
    </wp:cNvGraphicFramePr>
    <a:graphic>
      <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
        <dpct:pic>
          <dpct:nvPicPr>
            <dpct:cNvPr id="0" name="image"/>
            <dpct:cNvPicPr/>
          </dpct:nvPicPr>
          <dpct:blipFill>
            <a:blip r:embed="rId1"/>
            <a:stretch>
              <a:fillRect/>
            </a:stretch>
          </dpct:blipFill>
          <dpct:spPr>
            <a:xfrm>
              <a:off x="0" y="0"/>
              <a:ext cx="5270500" cy="3513667"/>
            </a:xfrm>
            <a:prstGeom prst="rect"/>
          </dpct:spPr>
        </dpct:pic>
      </a:graphicData>
    </a:graphic>
  </wp:inline>
</w:drawing>

Specifying an image in docx (method 2)

<w:pict>
  <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
    <v:stroke joinstyle="miter"/>
    <v:formulas>
      <v:f eqn="if lineDrawn pixelLineWidth 0"/>
      <v:f eqn="sum @0 1 0"/>
      <v:f eqn="sum 0 0 @1"/>
      <v:f eqn="prod @2 1 2"/>
      <v:f eqn="prod @3 21600 pixelWidth"/>
      <v:f eqn="prod @3 21600 pixelHeight"/>
      <v:f eqn="sum @0 0 1"/>
      <v:f eqn="prod @6 1 2"/>
      <v:f eqn="prod @7 21600 pixelWidth"/>
      <v:f eqn="sum @8 21600 0"/>
      <v:f eqn="prod @7 21600 pixelHeight"/>
      <v:f eqn="sum @10 21600 0"/>
    </v:formulas>
    <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
    <o:lock v:ext="edit" aspectratio="t"/>
  </v:shapetype>
  <v:shape id="_x0000_i1025" style="width:96pt;height:64pt" type="#_x0000_t75">
    <v:imagedata r:id="rId4" o:title="min"/>
  </v:shape>
</w:pict>

A Tale of Two Standards

TL;DR: UX Write 1.1 will support both .docx and .odt files. You’ll be able to  edit documents created in Microsoft Word or OpenOffice. (Clarification: .odt support has now been postponed till a later release due to the amount of time it’s taken to implement  .docx)

Now for the full story:

One of the most important things to consider when choosing a word processor is the ability for it to work with documents in a format that is compatible with other programs. There’s a lot of reasons for this:

  • You want other people, who may not have the same word processor as you, to be able to read and edit documents you create
  • You may want to work on your document on different platforms, such as your iPad and Windows PC
  • You probably want to ensure that anything important you write today will still be readable far into the future, even after today’s software empires have faded into history

Historically, most word processors used their own proprietary format, which only they could read and write. If you wanted to share documents with other people, they had to be using the same piece of software. And if you later wanted to change to a different word processor, you couldn’t use it to work with your existing documents, unless it included an import filter. Moving from AppleWriter to WordStar to WordPerfect to Microsoft Word was a pain, because each time, you’d have to start using a different file format. You’d also have to either convert all your old documents — or, more commonly, just leave them as-is, hoping that each new word processor would provide import filters for all file formats that had ever existed previously.

A friend of mine recently told me a story about attempting to recover a copy of her honours thesis that she had written in the early 90s using a program called WriteNow, running under NeXTStep. The application was developed entirely in 68k assembly, so even though she had an Intel copy of NeXTStep that could run under VMWare, it wouldn’t work with WriteNow. As it turns out, my friend is probably the only person in my city who owns a still-working NeXTStation (I got to see it running once — an experience I considered as fascinating as seeing a fully restored steam locomotive in operation). She was able to fire up the 20 year-old machine, open the .wn document, and export it to an RTF file, which could then be read on her Macbook Pro — though none of the formatting information survived the transition.

Even if you stick with the same piece of software, over a long enough period of time, the file formats can change. Throughout its history, Microsoft Word has had a total of five different native file formats (with a sixth on its way). Each time there’s a new release of Word in which the file format changes, there’s a period of several years during which people run into problems when they try to send documents to each other, because the sender has upgraded to a new version but the receiver hasn’t.

Sometimes this can even occur with the latest release of the same program on two different platforms. For example, Pages for the iPad stores its documents in a completely different file format to Pages for the Mac, and a behind-the-scenes conversion takes place whenever you exchange documents between the two. Apple have managed to mostly paper over this with iCloud in Mountain Lion, but on a Windows or Linux machine the only way to access your documents is to manually convert each of them one-by-one using either the iCloud.com web interface, or Pages on the iPad itself.

It’s 2012. We shouldn’t have to waste time messing around with stuff like this any more.

Enter open standards

In the early 2000s, a bunch of industry experts got together in an effort to solve this problem once and for all. The idea: define a vendor-neutral format for word processing documents, spreadsheets, and presentations — and make this format open and available for any software developer to implement. The result of this effort was Open Document Format (ODF), which has been ratified as an international standard by the ISO, and is supported by a wide range of office suites — including Microsoft Office, OpenOffice, LibreOffice, Calligra, and Google Docs.

ODF has been widely adopted in the government sector, where there is particular importance placed on storing data in a format suitable for long-term archiving, and avoiding dependence on a single vendor. Using a proprietary format risks the possibility that 30 years from now, documents that you create today won’t be readable anymore, because none of us have any idea what kind of hardware or software we’ll be using then. By using a publicly available, well-documented file format, programmers in the future can easily develop software to read documents in this format — something that is not at all easy to do with proprietary binary formats.

The only reason you’re able to browse the web and receive email on your iPhone is because the Internet is built on open standards. If Tim Berners Lee, who invented the world wide web (also on a NeXTStation), had used a proprietary, undocumented binary file format that could only be read by a program written in 68k assembly for NeXTStep — well, let’s just say that everyone would be spending an awful lot more time outside, because there would be no Facebook or Reddit or even Google.

“The nice thing about standards is that you have so many to choose from.”

– Andrew S. Tanenbaum

Just as with railway gauges, electrical outlets, and television signals, some people felt that one office document standard wasn’t enough, and what the world really needed was a second one. As the development of ODF progressed, Microsoft decided to create their own competing standard – Office Open XML (OOXML) – which served basically the same purpose as ODF but was closely based on the existing file formats already used by Microsoft Office, and thus easier for them to implement.

Like ODF, OOXML documents are zip files containing a collection of XML files and other content such as images. The two standards are quite similar at a conceptual level, though a lot of the details differ, and developing a piece of software that supports both involves significantly more work than just one.

Microsoft submitted OOXML to the ECMA, and subsequently the ISO, to develop it into an international standard. To their credit, this was a major step forward from their earlier efforts to prevent competitors from being able to work with their file formats, and the use of XML made the format immensely easier to deal with in comparison to the binary .doc, .xls, and .ppt formats used previously. However, the submission and process surrounding it met with a great deal of criticism from many on the standards committee and the industry in general, and was initially rejected. Only after the specification was significantly modified to remove the dependencies on legacy and poorly-defined features (such as the autoSpaceLikeWord95 formatting option), was it finally accepted.

The result of this however was not just one standard, but two — “OOXML Strict” and “OOXML Transitional”. The strict version is what finally convinced a sufficient number of committee members to vote for its acceptance, while the transitional version (essentially the original submission that was rejected) got tacked onto the end on the condition that it would only be used for representing documents converted from older formats, not for newly-created documents.

Currently-shipping versions of Microsoft Office only fully support the transitional version of the standard, not the strict version. Office 2010 can read strict documents, but not write them. The upcoming Office 2013 finally adds the ability to both read and write strict documents.

What about HTML — you’re using that already, right?

Yes.

UX Write currently, and always will, use HTML as its native file format. The main reason for this is that it uses WebKit as its layout engine. Writing a layout engine like this is an insanely huge task, and WebKit is an extremely mature engine that been continuously developed for more than twelve years. I’m not sure how many people have been involved in that time, but I would estimate it’s probably had between 100 and 200 man years put into it. Without WebKit, it would not have been feasible for me to develop a program with the functionality of UX Write in one year by myself. In fact, most of the credit for UX Write really belongs to the WebKit developers, since writing the layout engine is most definitely the hardest part of building a word processor.

HTML has two other significant benefits:

  • Its syntax is really simple. You can edit HTML files in a text editor, and it’s very easy to learn. Compared to ODF and OOXML, which both have very complex XML structures, HTML files are small and easy to work directly with by hand.
  • It’s the “lingua franca” of the web. Every web page you’ve ever visited is in HTML format (except for a few built in the recently-deceased and little-mourned Flash), and if you’re publishing online there is no conversion step necessary if you’ve originally created your document as a HTML file. The EPUB standard for e-books is also based on HTML, so it’s ideal for publishing on e-readers as well.

Despite these benefits, HTML is rarely seen as a “word processing” file format — and for good reason. HTML lacks many of the features you typically need for word processing tasks, like headers, footers, footnotes, citations, automatic numbering, and automatically generated tables of contents. All of these can be “faked” using manual editing or javascript (which is what UX Write does in the last two cases, and will do for others in the future), but there’s no way to represent these features in a way that will be recognised by other HTML editors as anything other than regular text. So if you need these features, you’re better off using ODF or OOXML.

Choosing sides — or not

From the above description, you’ve probably got a pretty clear picture of where I stand on file formats. However, let me address one remaining factor: pragmatism.

The reality is that despite the best efforts of many in the industry to promote ODF as “the” standard for office documents, OOXML is by far the most widely used of the two. Even though Microsoft have fully supported ODF in the Windows versions of MS Word since Office 2007 SP2 (though sadly not on the Mac), the default save format is .docx, so that’s what almost everyone saves in (and therefore expects to be able to access on their iPad). It’s been clear to me from the beginning that developing a word processor that can’t open .docx files is a pretty good way to ensure failure, so including this support has always been part of my plan. This was reinforced when I initially released the app and sent a pitch to Walt Mossberg – one of the tech industry’s most respected journalists – and received a very brief reply in which he cited the lack of Word support as a “very, very serious downside”.

Also, in looking into the technical details of both file formats, my opinion (so far) is that OOXML isn’t actually that bad. Sure it has its flaws, and there are some bad design decisions here and there, but both ODF and HTML have their share too. I estimate that the amount of work required to support OOXML is about the same as for ODF, and while there’s more complexity than strictly necessary, it actually has a pretty sensible design (note: I’m referring only to the word processing portion of the standard; spreadsheets are another story).

One clarification I should make is which version of OOXML I’ll be supporting:

“Anyone who claims that their product supports `OOXML’ in an unqualified sense, without stating which conformance target or conformance classes they are supporting, is not stating anything of substance. It is like trying to buy an electrical plug adapter by just saying `I need electricity’”.

Rob Weir

UX Write will only support the transitional version of .docx. There is zero point in using the strict version, because the only product that supports it is Office 2013 (which isn’t even out yet), and due to the several-year lag times involved with people upgrading to new versions of Office, I anticipate that transitional will remain in use for quite a long time. Furthermore, Office 2013 saves as transitional by default (at least in the current preview version), so that’s what almost everyone will use. And since OOXML Strict is a different file format to OOXML Transitional, if you are going to switch formats, you might as well go to ODF, which has much broader support.

So if .docx (transitional) has become the de-facto standard, and is technically relatively ok, why not just support that, and forget about ODF? I’ve been given strong opinions on both sides of this question, but I decided a while back that supporting ODF made sense, for several reasons:

  • First and foremost, everything I’ve said above about open standards. OOXML is “open” on paper — it has been formally certified as an ISO standard — but if you read into the politics behind its development you’ll find that the process fell far short of the openness and level of cooperation typically found in ISO standardisation efforts.
  • ODF is supported by a wide range of applications, and used by a lot of people. Many state and national governments have adopted the format, and OpenOffice has reportedly had over 100 million downloads. That’s a lot of potential customers for UX Write, and I don’t want to ignore that part of the market.
  • Due to the technical similarities between the two, I’m able to use the same implementation techniques for both formats. Doing both formats instead of one is maybe 30-50% extra work, not 100%.

So I’ve said my bit about open standards, but from here on I’m mostly going to be focusing on the technical aspects of the formats and making sure they’re both supported equally well in UX Write. Which of these two formats you prefer is a decision for you to make, and whatever you choose, UX Write will work fine with it. As I continue to add features like footnotes, bibliographic citations, and equation editing, I’ll be doing them in lock-step with both formats, so both will work just as well.

Work on both formats is progressing well, and all the foundations are in place, along with basic editing capabilities. It’s going to be a while yet before I have a release out which supports these two formats, but I anticipate having a version available to beta testers within the next few weeks. When I get time (which is not very often, due to how busy I am with development), I plan to post some more details on what I’ve learnt from implementing the two formats – in particular comparing how each of them represents various features.

LaTeX users, I haven’t forgotten about you

This post is already long enough, but I want to make it clear that I still plan on adding LaTeX support. As someone who’s spent many years in an academic research environment (in the School of Computer Science at Adelaide University), I’m well aware that LaTeX is the de-facto standard in many scientific and engineering fields. I wrote my honours and PhD theses, as well as all of my research papers and course notes, in LyX (a graphical front-end to LaTeX). If it wasn’t for the fact I’m developing UX Write, I’d likely never consider using anything else.

However, I consider this a less-urgent task than ODF/OOXML, particularly since UX Write is based on a semi-WYSIWYG model, and many LaTeX users prefer to write code by hand. After the above two formats are done I’ll be working on other features like find & replace, spell checking, and citations, but LaTeX is still very much on the roadmap. In the meantime, you may be interested in checking out Texpad, the developers of which have recently gotten the first successful port of LaTeX to iOS working.

For the three of you who are still reading this

One of my favourites from XKCD:

XKCD: Standards