X

Porn sneaks past search filters

Search companies are increasingly turning to censorware to court G-rated customers such as corporations, schools and parents, but they're still showing too much skin.

Paul Festa Staff Writer, CNET News.com
Paul Festa
covers browser development and Web standards.
Paul Festa
5 min read
Search companies are increasingly turning to censorware to court G-rated customers such as corporations, schools and parents, but they're still showing too much skin.

The shortcomings of porn filters were on display last week when Google launched a test version of a search engine for images with an optional filter for what it terms "inappropriate adult content." Even with the filter turned on, Google is serving a healthy dose of pornographic images, often for keywords with primarily nonsexual meanings.

"The filter removes many adult images, but it can't guarantee that all such content will be filtered out," Google acknowledges on its Web site. "There is no way to ensure with 100 percent accuracy that all adult content will be removed from image search results using filters."

Google is hardly alone in the uphill battle to filter pornographic and other sensitive images. Technology companies devoted to image recognition acknowledge that the state of the art is still crude, yielding inexact results at the cost of computing power.

While technologists struggle to improve their tools, the market for image filtering is the subject of dispute. Google cites the need to protect its "sensitive" users, while search destination AltaVista touts its own filter as indispensable.

"A picture says a thousand words, so we want to make sure that the image search is filtered by default," said AltaVista spokeswoman Kristi Kaspar. "We find that quite a few people are using the image search database for school. And what a huge turnoff if we're in an education market with a great product and we couldn't figure out how to provide a family filter."

In another demonstration of potential demand for better image-filtering technology, Lycos deemed the available technology so inadequate that the site's parental controls disable multimedia search altogether.

Some in the image-recognition business see a burgeoning corporate need to identify what kind of images their employees are downloading, while others extend the technology to e-commerce applications that can recognize a product such as an article of clothing and find similar examples for sale elsewhere.

But according to at least one image search provider, actual use has not lived up to perceived demand.

"Image filtering is something where we're investing a lot of (research and development) because we think it's going to be an essential feature," said Tom Wilde, vice president of marketing at Fast Search & Transfer, an Oslo, Norway-based company that is the search technology provider for Lycos.com and other Web portals. "But there's a difference between the perception of growing market demand and what's actually happening. At our All The Web portal, 98.6 percent of our visitors are using the image search without the content filter on."

Testing barriers
Regardless of demand for filtered image searching, several companies are struggling to get a handle on the problem.

Google noted that its image filter is still in beta and said engineers are working to improve the product. But company representatives acknowledged that they face a daunting task.

"It's a real challenge to do this effectively for a lot of different reasons," said Susan Wojcicki, product manager for Google search. "There is a lot of pornography out there on the Web. If all the porn were in one place, we could cut it out. But it's everywhere. Also, the definition of porn is not very clear."

Even with consensus on a pornography definition, technologists have their work cut out for them. Current techniques fall into three categories. The first attempts to filter images by analyzing the text that names and surrounds them on a Web page.

This method runs into several problems. For example, many words that belong to the pornographer's lexicon also fall into birder's dictionaries, guides to animal husbandry and hardware catalogs. As a result, text-based analysis turns up a high proportion of both false positives and false negatives, screening out wren tits and wood screws while admitting more salacious content.

More problems with the text-based approach accompany foreign-language pornography. For now, the Google filter works only on English-language pages.

After text filtering, the second avenue of attack screens out images gleaned from blacklisted Web addresses where pornography is deemed likely to turn up.

But pornography has proved a faster target than such lists can catch.

"Most of the firewalls have lists of URLs, but porno sites change their URLs regularly," said Bill Armitage, chief executive of Bulldozer Software, a Clinton, Mass.-based image-indexing and search technology provider that operates the Diggit search engine. "Those lists are always out of date. At any given time they're only 60 to 80 percent accurate. The remaining 40 to 20 percent of the time, you need another filtering mechanism to keep those things from coming in."

For that extra layer of protection, many search engines are pinning their hopes on the third and most complex method, which analyzes the image itself for "flesh" tones and body shapes. But this method returns its own share of false negatives--letting pornography in--and false positives, blocking more innocuous images.

"I'll tell you what slips through--baby pictures slip through," said J.J. Wallia, head of sales and business development for LookThatUp, a Paris-based company with offices in Burlingame, Calif. "That's a false positive. Babies tend to be showing a lot of skin. This is something the industry has just not been able to get around."

Perhaps more damning than the occasional excluded infant is the toll that image analysis exacts on central processing units (CPUs).

"The state of the art on image searching is such that there is no surefire pornography detection available," said Fast Search & Transfer's Wilde. "The big search engines have not yet done that because it's not scalable enough to keep up with the growth of the Internet. It's incredibly CPU-intensive to do image processing. We have 70 million images in our index. The image detection software that's available now gets absolutely crushed by that."

Wilde estimates that the image recognition industry is between six and 12 months away from providing an adequate product.

Even then, he warns, problems will remain.

"If you do some sort of flesh detector, what color is flesh?" Wilde asked rhetorically. "It's really that complex. And then what's pornographic? You have different sensitivities, especially internationally. Then there's hate, weapons and violence. It's a really, really difficult problem to solve."