X

Working with Office 2007 documents under Mac OS X: Extracting text from .docx files, more

Working with Office 2007 documents under Mac OS X: Extracting text from .docx files, more

CNET staff
3 min read

You may have a few Windows-using friends or co-workers who have already made the leap to Office 2007, and know more who are planning to upgrade in the coming weeks. With its new file formats, Office 2007 creates documents that won't be readily accessible under any current version of Office for Mac OS X (v.x or 2004).

Microsoft has promised beta file conversion utilities in Spring 2007 that will allow you to open these files (dubbed "Open XML") in Office 2004 (and possibly Office v.X), but users seeking interoperability are largely left in the lurch until then. There are, however, a few promising means for exchanging documents with Office 2007 users, or at least extracting pertinent data from said documents.

Files created with the new Open XML format used by Office 2007 are actually ZIP packages that contain various XML files as well as images and other data. Since they are actually archived folders, the "meat" of any document will be stored in a directory (or directories) within the document packages. For instance, in a .docx document (created by Word 2007), there is a directory labeled word that contains various XML documents with the actual text. For Excel, items are located in the /excel directory, etc.

Manual expansion/stripping As such, one brute alternative for extracting data under Mac OS X is to change the .docx extension of a received Office 2007 document to .zip, (e.g. file.docx to file.zip) then double-click the file to expand it. You can then peer inside the expanded folder (contents of a Word .docx file shown to right). Again, the items you want are located in the located in the /word (or name of other Office application) directory.

For Word documents, once you've opened the /word directory, you'll probably see a series of .xml files named as such:

  • document.xml
  • endnotes.xml
  • header1.xml

The names are self-explanatory: you'll generally find the body text within the document.xml file.

Once you've found the appropriate file(s), you can either open it in any text editor an manually strip the XML, or you can use a tool like downCast to convert it (without great accuracy) to a RTF (rich-text format) document that can be opened in Word v.X or 2004.

BBEdit, for instance, has a function that will strip most XML tags from documents, leaving plain text. Open the .xml file in BBEdit, select the appropriate text, then go to the "Markup" menu, and select "Utilities" then "Remove Markup." Some other text editors have similar functionality.

You can practice with some sample files available from OpenXMLDeveloper.org.

docx-converter.com A Web site dubbed docx-converter can translate a Microsoft Word 2007 .docx file into a simple html file. According to its creators, the tool "strips out some of the formatting, but now supports bold, italic, and underlined text. Left, right, center, and justified alignment. Unicode characters, and more!" This is a great interim solution that has the key advantage of retaining some formatting, but the site might buckle under heavy load.

Windows-side saving Though it's inconvenient and impractical in many real-world cases, one obvious solution to this problem (and the one suggested by Microsoft's Mac BU) is to ask Office 2007 users to save their documents in "Word/Excel/PowerPoint 97-2003" format (.doc, .xls, .ppt). Some document elements might be lost in the process, but this will ensure interoperability with Mac versions of Office.

Wait until January for a new OpenOffice.org release Novell has stated that it is working on and supporting an open-source project to bring Office 2007 document (Open XML) opening support to a coming release of OpenOffice.org, the rival productivity suite that is available as an X11 application, which can run under Mac OS X. A CNET article says "By January, Novell said, users of the OpenOffice word processor will be able to read documents saved in the Office Open XML format, the default setting for Microsoft's recently released Office 2007 suite."

Feedback? Late-breakers@macfixit.com.

Digg!

Resources

  • More from Late-Breakers