Want CNET to notify you of price drops and the latest stories?

Editing tips from the NSA

Advice comes after embarrassing incidents in which sensitive data was unintentionally stored in electronic documents.

Joris Evers Staff Writer, CNET News.com
Joris Evers covers security.
Joris Evers
5 min read
Hiding confidential information with black marks works on printed copy, but not with electronic documents, the National Security Agency has warned government officials.

The agency makes the point in a guidance paper on editing documents for release, published last month following several embarrassing incidents in which sensitive data was unintentionally included in computer documents and exposed. The 13-page paper (click here for PDF) is called: "Redacting with confidence: How to safely publish sanitized reports converted from Word to PDF."

Instead of covering up digital text with black boxes, it is better to delete any information you don't want to share, the NSA suggested.

"The key concept for understanding the issues that lead to...inadvertent exposure is that information hidden or covered in a computer document can almost always be recovered," the NSA wrote in the Information Assurance Division paper, dated Dec. 13 but only recently posted to the Web. "The way to avoid exposure is to ensure that sensitive information is not just visually hidden or made illegible, but is actually removed."

Three common mistakes

There are a number of pitfalls for people trying to amend a sensitive Word document for public release as a PDF. Here is the NSA's advice on typical traps.

Redaction of text and diagrams
Covering text, charts, tables or diagrams with black rectangles, or highlighting text in black...is not effective, in general, for computer documents distributed across computer networks (i.e. in "softcopy" format). The most common mistake is covering text with black.

Redaction of images
Covering up parts of an image with separate graphics such as black rectangles, or making images "unreadable" by reducing their size, has also been used for redaction of hardcopy printed materials. It is generally not effective for computer documents distributed in softcopy form.

Metadata and document properties
In addition to the visible content of a document, most office tools, such as (Microsoft) Word, contain substantial hidden information about the document. This information is often as sensitive as the original document, and its presence in downgraded or sanitized documents has historically led to compromise.

Source: NSA Information Assurance Division report

The unintended disclosure of metadata, resulting in high-profile leaks of secrets, has led to red faces at businesses and government bodies in the past. In March 2004, a gaffe by the SCO Group revealed which companies it had considered targeting in its legal campaign against Linux users.

More recently, pharmaceutical giant Merck was put in the hot seat because of changes made to a document regarding the painkiller Vioxx. There have also been document data leaks at the White House, the Pentagon, the United Nations and others, according to compiled research from Workshare, a maker of software that strips tell-tale hidden data out of files.

There have been so many stumbles that the NSA document should be welcome help, said Pete Lindstrom, an analyst with Spire Security in Malvern, Pa.

"It ends up being a really big exercise in public humility because it is an embarrassing issue," he said. "It affects governments more than anyone else."

Cleaning up
Government analysts make three main missteps that will jeopardize confidentiality when sanitizing documents, according to the NSA report. "The most common mistake is covering text with black," the agency said. While this works for printed material, "it is not effective, in general, for computer documents."

The second top goof is similar: In this case, workers cover up graphics and other images with new graphics, such as a black rectangle. As with blacked-out text, a recipient of the document can often delete the coverings and see the information that is intended to be hidden. The third gaffe is failure to remove information about the document, such as change history, author name and creation dates, known as metadata.

To avoid such blunders, the NSA paper gives step-by-step instructions on how to strip a Microsoft Word document of confidential information and then convert it an Adobe Systems PDF file. The advice deals with text passages and images in the document, as well as with metadata.

Both the Word and Adobe PDF formats can contain many kinds of information--such as text, graphics, tables, images and metadata--all mixed together. "The complexity makes them potential vehicles for exposing information unintentionally, especially when downgrading or sanitizing classified materials," the NSA said.

Microsoft Word is used throughout the Department of Defense and the intelligence community, while Adobe PDF is used "very extensively" by all parts of the U.S. government and military services, the agency said. It noted that government bodies often distribute cleaned-up documents in PDF format, and cautioned: "As numerous people have learned to their chagrin, merely converting an MS Word document to PDF does not remove all metadata automatically."

Metadata methodology
Metadata could become an increasing problem in the future, Gartner analysts warned recently. Vista, the next version of the Microsoft Windows operating system, will let people tag files with metadata to improve search capabilities, Microsoft has said. But those tags could lead to unwanted disclosure of information, Gartner analysts said.

Microsoft provides some tools to remove metadata in its Office applications and built into Word 2003 a feature to remove personal information. However, these do not remove sensitive data from the main document, nor do they remove all metadata of possible concern, the NSA said.

Adobe supports the agency's guidance for proper editing techniques and is developing additional documentation for other customers, John Landwehr, director of security solutions and strategy for the San Jose, Calif., technology company, said in a statement via e-mail.

"As the NSA points out, it's very important to actually remove the redacted content from an electronic document--not just leave the data in a document and attempt to graphically cover it," he said.

Following the guidelines will effectively clean a document, said Joe Fantuzzi, chief executive of San Francisco-based Workshare, but could be challenging for the less tech-savvy.

"They are way too complicated. It is going to take too long for people to do the right thing, and people are going to continue to make mistakes," he said.

Meanwhile, the NSA paper itself contains a bit of metadata. According to its cover the paper was created on Dec. 13, 2005. The properties of the Adobe PDF file, however, state the document was created on Jan. 10, 2006.