CNET también está disponible en español.

Ir a español

Don't show this again

Tech Industry

New software takes aim at messy data

Companies are tapping analytical software to streamline reams of botched files, one of the most basic problems of doing business in the digital age.

Ralph Nordstrom was facing the biggest challenge of his career.

Earlier this year, the new data manager for one of the nation's largest real estate companies, Parsippany, N.J.-based National Realty Trust (NRT), had to merge files from 750 offices for a company with $106 billion in annual sales. At the same time, NRT was absorbing smaller companies after an acquisition binge. And Nordstrom had to standardize data from 38,000 agents who had calculated commissions inconsistently, resulting in headaches for the payroll department.

Nordstrom turned to a new generation of analytical software to streamline reams of conflicting, duplicated and otherwise botched files--an unseemly plague that software engineers dub "dirty data" or "bit rot." When computers cannot interpret data, software engineers may lose or delete important files, companies may lose track of customer lists, and executives may be forced to abandon entire business strategies.

"Basically, you have so much data that's wrong or unreadable, and you need that data to support business decisions," said Nordstrom, who works out of NRT's information technology branch in Mission Viejo, Calif. "That situation prevents you from making business decisions with confidence. You make any number of bad decisions because of it."

Dirty data remains one of the most basic and ubiquitous problems of conducting business in the digital age--though few data managers are eager to talk about it. The problem is perceived as too degrading to discuss.

But 75 percent of the information technology directors polled recently by PricewaterhouseCoopers said they experienced problems related to faulty data. Only one-third of IT managers at large corporations said they felt "very confident" about their company's data quality.

"Of course we deal with it," an IT manager in Sunnyvale, Calif., said. "We may not talk about it, but yeah, we deal with it. If you've ever sent two confirmation e-mails to the same customer, you've dealt with it."

Consequences of dirty data range from missing a monthly payment to bungling multimillion-dollar marketing campaigns. It causes retailers to send multiple catalogs or other promotional mailings to the same customer, and it has caused an unfortunate few to send multiple payments to creditors--who may be happy to cash a second or third check.

The problem is especially nagging for e-commerce companies, which pride themselves on their automated business practices yet still struggle with inconsistent or unreadable customer lists that thwart grandiose marketing campaigns. Online retailers say the problem can foil the most expertly conceived marketing campaign.

The $64,000 question
It is impossible to quantify how much money companies lose because of dirty data, but experts say it is one of the costliest--and most ignored--problems facing businesses. Smaller retailers that cannot afford dirty data cleanup software--which can cost up to $1 million--are particularly confounded by how to clean up their data.

How to clean data is "the $64,000 question," said Gary Hennerberg, head of marketing consultancy The Hennerberg Group in Grapevine, Texas. "To some extent, any problem with data is a big problem because it leads to even bigger problems. Cleaning up data is a grand and wonderful notion, however difficult."

The root causes of dirty data are clear in only a small fraction of cases, such as when hard drives are exposed to corrosive magnetic fields. The vast majority of the causes remain nebulous, although they usually involve human error.

For example, a data entry administrator could type a letter into a field for telephone numbers, botching the entire file. Or a telephone agent taking a catalog order could list a customer on first reference as "Ann Marie Smith," but the customer orders an item online as "A.M. Smith." The computer interprets it as separate individuals.

Dirty data is also a likely byproduct of mergers and acquisitions, when systems engineers try to purge legacy systems and merge data into a standardized, updated system. Almost inevitably, the new system cannot interpret some files from the legacy system, which is usually the older, outdated system.

One of the most nefarious aspects of dirty data is its capacity to lurk, unchecked, for years. Software engineers rarely spot dirty data until a new project requires an esoteric bit of data, which then turns out to be unreadable or flawed.

That is what happened to the U.S. Bureau of Land Management (BLM), which had to wade through more than 200 years of land-use records so its Lands and Minerals agency could become Year 2000 compliant.

Much of the data was entered by hand during the last two centuries, including oil and gas leases, mining claims throughout the West, and permits for lands withdrawn from government restrictions. When the bureau first installed computers in the 1960s and 1970s, data entry errors were common. Most recently, field offices have had trouble updating files, often putting numbers such as "999999" in the date field.

"In the old systems that were hierarchically based, there weren't a lot of edits or constraints you could put on the data. A lot of people put a lot of stuff on the data that wasn't useful," said Leslie Cone, a manager for the Land and Resources Projects office in Denver, which manages the BLM's software projects. "We learned our lessons from that."

The BLM, like many other organizations, turned to an emerging genre of data-cleaning software. The largest suppliers are San Francisco-based Evoke Software and Durham, N.C.-based Metagenix.

Data-cleansing systems alert administrators to errors--ranging from a driver's license number entered in the Social Security number field to different customers with the same last name and address (a likely indication that the family only needs one catalog, for example). The software can also identify when a retailer's separate divisions share customers.

A relative bargain?
An Evoke suite costs roughly $1 million for license, concurrent session usage, training and consulting. Evoke's 75 customers range from kitchen goods retailer Williams-Sonoma to computer giant IBM. The hefty price tag puts it far out of reach for most small businesses, but Evoke executives insist it is a relative bargain and say they are trying to make inroads among small companies.

"It's not inexpensive, but it's an enterprise-wide solution," said Rick Cortese, chief operating officer of Evoke. "We can reduce time and labor expense in the analysis phase by 35 percent--in most data migration projects...that's multiple millions of dollars."

Cleaning data has become a top priority at companies that have invested heavily in CRM (customer relationship management) software. CRM--which hinges on sophisticated databases to profile customers, boost their satisfaction, and increase revenue--has become one of the hottest trends from Wall Street to Silicon Valley.

Despite their popularity, CRM projects are often foiled by error-laden, inconsistent and otherwise dirty data.

In fact, less than a quarter of companies trying to execute CRM projects produce an error-free customer profile, according to a July survey by Gartner. More than 80 percent of systems engineers underestimate the time and resources needed to clean data, and many companies go over budget by 200 percent to 300 percent, according to the Gartner report.

Nordstrom and his employer, NRT, are also Evoke customers and are planning to unveil a new computer system in mid-October. After installing the software and cleaning up reams of data, Nordstrom says NRT will be able to dramatically expand its number-crunching abilities.

For example, NRT will be able to chart profit and loss statements for individual branch offices over time--something that was typically done in static reports without historical context. NRT will also chart profitability and how it relates to demographic information of neighborhoods where it sells homes.

"We haven't had the capability of doing any interesting, detailed analysis on demographics, profitability and that stuff, until now," Nordstrom said. "Once we really get into the analytics, I think we'll wonder what we did in the old days."