According to a statement from the company, scientists from Xerox Research Center Europe will announce new software Thursday that can examine the contents of an electronic document and then classify it by subject.
The software, which Xerox intends to license to other technology companies, could be used to automatically route documents into a content management system. Content management is a fast-growing category of business applications that store and catalog corporate text, ranging from e-mail messages to regulatory filings.
Xerox's categorizing software could improve the efficiency of such systems by automating the storage of documents and making it easier for workers to find the document they need. The system uses a hierarchical method that recognizes relationships between one category and another.
"A misshelved book in a library might as well be lost," Xerox researcher Eric Gaussier said in the statement. "It's the same with documents that haven't been properly categorized; the document itself may have to be re-created...Our new software...will ensure that documents are properly classified for future retrieval and that the right information gets into the right hands as quickly as possible."
The technology could also be used to automatically route e-mail messages to the correct person in an organization, Xerox said.
The software uses machine-learning techniques to minimize setup and to recognize new categories of documents as they emerge, Xerox said. The Java-based code can parse documents in more than 20 languages and work with systems based on Unix, Linux and Windows.