X

Search specialist stakes its claim on names

Language Analysis Systems plays the terrorism card to pitch tools that make sense of "Smith," "Smythe," "Smits" and "Schmidt."

Michael Kanellos Staff Writer, CNET News.com
Michael Kanellos is editor at large at CNET News.com, where he covers hardware, research and development, start-ups and the tech industry overseas.
Michael Kanellos
5 min read
Why would a company or government agency need to adopt specialized search technology just for finding references to a person's name?

Among other reasons: Because Raouf Gadi, Elraouf Djeddi and Abdulrauf Aljadai might be the same person--and a regular search engine wouldn't reflect that.

Language Analysis Systems, or LAS, of Herndon, Va., has devised a series of tools for solving one of the thornier, but often overlooked, problems in search: finding data on a particular individual in a multicultural, error-prone world. The company's software takes into account alternative spellings, cultural nuances and other linguistic issues as part of an attempt to return the most relevant information for a search query, rather than a laundry list of close matches.

Jack Hermansen
Jack Hermansen

The company's tools are mostly sold to law enforcement, intelligence and border agencies, but financial institutions and other businesses--hoping to ferret out fraud or merely improve their customer databases--have begun to adopt the technology too, said Jack Hermansen, chief executive and co-founder of LAS.

Since the terror attacks of Sept. 11, 2001, security and intelligence agencies have been hunting vigorously for technology that will help them gather information on potential terrorists. Pixlogic, for instance, has developed software meant to spot anomalies or suspicious individuals in videotape from security cameras. Language Weaver, meanwhile, has come up with an Arabic-English real-time translation tool.

"The penalties for missing a name are enormous," Hermansen said. "Someone could die."

Contrary to what one might think, names aren't great search terms. Handles like "Bob Johnson" or "Ted Smith" are broad and pull up thousands of false positives. Even if the searcher types the name correctly, a typo in a document being sought or the use of a nickname could mean that the results omit needed information, or that a crucial link won't pop up on the first few screens of search results.

Cultural and linguistic differences compound the problem, Hermansen said. Someone called "Paul Ho" in the United States could easily be known as "Ho Wan Lee" in Hong Kong, and different documents may show his name appearing variously in Roman and Chinese characters.

Often, U.S. companies and agencies also mangle their data. One of the most common mistakes derives from assuming that the middle name in a three-part name, such as "Maria Sanchez de Rodriguez," is a middle name.

"There seems to be an ethnocentric naivete in the U.S.," Hermansen said. "Trying to put a six-part Arabic name into a first-middle-last-name construction is going to raise havoc."


The subjects of these searches, of course, often try to avoid detection. Many years ago, although he was on a watch list, Mir Aimal Kansi got through customs and entered the United States using a common variation of his Urdu name--Kasi. Later, he killed two CIA operatives in 1993. (He was subsequently captured, convicted and executed.)

Some of the efforts by officials to correct these problems are comical, Hermansen said. One suggestion from the 9/11 Commission was that the U.S. government standardize spellings of names such as "Mohammed."

"You're going to tell naturalized citizens that they have to spell 'Mohammed' the same way," said Hermansen, who has lobbied against the directive. "Everyone would like it to be simple, but it isn't going to go away."

The software packages offered by LAS vary by function, and many customers deploy a combination of applications. The software is complemented by a database of 850 million names compiled and analyzed by LAS during the last 20 years. The company buys lists of names from information clearinghouses; to protect privacy, no personal information is collected, and first names are separated from last names before being delivered to LAS.

One tool, NameVariationGenerator, generates a list of common variations of a name--"Akbar," for instance, can be represented 20 different ways--and then searches for these variants.

MetaMatch picks up phonetically similar names, such as "Leighton," "Leyton" and "Leaton." And NameHunter searches for names while paying attention to letter variations and cultural variations.

An application coming soon will transliterate names in Arabic, Chinese, Korean and Thai characters into Roman characters and then apply the search and classification techniques. Thai characters are important, Hermansen said, in part because Thailand is "one of the major centers of drug activity."

Additionally, LAS has begun to license its software to other corporate application and database developers, who then add name search as a module into their products.

In the past two years, demand has increased rapidly. Revenue at privately held LAS grew by 80 percent from 2003 to 2004 and will grow at an even faster rate this year, Hermansen said. He declined to provide details.

A life of names
Hermansen's introduction to the linguistics of names started early. He grew up after World War II in Greece, where his father helped administer the Marshall Plan. Senators and congressmen would visit regularly and give speeches, and Hermansen's father coached them on how to say "thank you" in Greek: "efharisto."

"He'd tell them, 'It's easy. Just think of the name 'F. Harry Stowe,'" Hermansen recalled. "At the end of the speech, they'd always say 'Harry F. Stowe.' Everyone in the audience would just hoot and applaud."

Later, Hermansen obtained a doctorate in computational linguistics from Georgetown University. His dissertation was on the flaws of the Soundex system, a name classification system developed to analyze data from the 1890 census that is still in use.

In the Soundex system, a number represents a group of consonants. For instance, a "5" can be an "M" or an "N." Spelling a name involves removing the vowels and replacing all but the first consonants with numbers.

The limitations are many. "'Kanellos' is the same as 'Kiematteg,'" Hermansen said. "And if 'Kanellos' is ever spelled with a 'C,' you will never find it...It is a very crude device."

While he was working on the dissertation, the U.S. State Department put out a request for proposal for a names database. Hermansen won the project, which morphed into LAS. While the company garnered its revenue almost exclusively from consulting contracts with the government agencies, it started offering commercial products in 2002.

"It was an eight-month contract that turned into a 20-year career," Hermansen said.