X

Researchers work to eradicate broken hyperlinks

Computer science researchers at the University of California at Berkeley say they have come a step closer to solving a frustrating problem familiar to most Web surfers.

Evan Hansen Staff Writer, CNET News.com
Department Editor Evan Hansen runs the Media section at CNET News.com. Before joining CNET he reported on business, technology and the law at American Lawyer Media.
Evan Hansen
3 min read
Researchers at the University of California at Berkeley say they have come a step closer to solving a frustrating problem familiar to most Web surfers: the broken hyperlink.

In a recent academic paper, computer scientists Thomas A. Phelps and Robert Wilensky outlined a way to create links among Web pages that will work even if documents are moved elsewhere. Although researchers have tried to tackle the issue before, Internet search experts said the paper describes a potentially elegant solution to a widespread and long-recognized puzzle.

"It's a pretty clever way of dealing with a very difficult problem," said Ron Daniel, who once worked on an alternative solution that has been submitted to the Internet Engineering Task Force, an online standards body.

A key feature of the Web is its ability to take readers instantly to related documents through hyperlinks. Some consider it the soul of the medium. But as many as one in five Web links that are more than a year old may be out of date, according to Andrei Broder, vice president of research at search engine AltaVista. When surfers click on such links, they get a "404 error" message.

"The rate of change on the Web is very fast," he said. "And the more active a Web site is, the quicker it changes."

In their paper, Phelps and Wilensky say the preliminary results of their research indicate that the vast majority of documents on the Web can be uniquely identified based on a small set of words that no other document shares. This set of words can be used to augment the standard URL (Universal Resource Locator), or Web address, and turn up the page if it goes missing.

One of the things that makes the research interesting, Wilensky said, is the low number of terms required.

"It takes about five words to uniquely identify a page if you pick the words cleverly and the page is still out there somewhere," he said.

If a document's URL changes, a search engine could be employed to automatically locate the missing page based on the five terms.

"What makes this possible is that you already have a search engine infrastructure," said Wilensky, a professor of computer science at UC Berkeley, who gave most of the credit for the work to Phelps, a postdoctoral student. "You're 'bootstrapping' onto something that's already been built."

Wilensky also noted that the system would rely primarily on Web publishers, rather than on a third-party administrator, an issue that had become a hurdle for some other plans.

AltaVista's Broder concurred that the results of the research were promising, reflecting similar research he has conducted on "strong queries"--or complex searches--in which he found that any document can be uniquely identified using eight carefully selected terms.

"The trick is to find the right formula of rare words that are also important to the meaning of the document," he said.

But Broder warned that the procedure carries the risk that selected words may later be edited out of the document, rendering the identifier moot. For example, he said, in the Phelps and Wilensky paper, the authors used a misspelling, "peroperties," as an identifying term for their paper.

He said the most promising element of the work was the fact that it is compatible with existing systems.

"There is a chicken-and-egg problem involved," he said. "None of the big players will adopt (this kind of system) until a lot of people start using it."

Daniel, who said he has given up active research on the problem in part because of a lack of commercial interest in his work, said Phelps and Wilensky may have hit on a way to solve two parts of a three-part problem: determining an identifier and establishing how the identifier will be linked to a document over the long haul.

But Daniel said they haven't figured out what to do with pages that are deleted from the Web altogether.

"Storage is an interesting issue," he said, adding that intellectual property concerns and rights management could become an issue down the road. "At some point perhaps libraries will evolve into taking an active role in indexing pages. But that will depend on publishers giving out the necessary licensing."