Words to the wise on the Web

Language expert and PARC veteran Geoffrey Nunberg explains why machines are still struggling to make sense of the way people communicate--and how the Internet has people writing more now than ever before.

Sept. 10, 2002 1:01 p.m. PT

10 min read

Supreme Court Justice Potter Stewart once quipped that though he couldn't define pornography, he knew it when he saw it. Will filtering software ever have it that easy?

Not anytime soon, and not without a lot of human intervention, according to language expert Geoffrey Nunberg. The Internet is too vast and diverse, and the applications too indiscriminate in their quest for the obscene and the pornographic, he says.

But Nunberg, a professor of linguistics at Stanford University--and until last year, a principal scientist at Xerox's legendary Palo Alto Research Center--wouldn't want to do away with the software, so long as people recognize its limitations. Indeed, he's given a portion of his wordsmithing career to developing software that makes sense of written material.

That experience led him to provide expert testimony in a court challenge to the Children's Internet Protection Act. In that case, a federal court ruled in May that Web-filtering technology blocked both too much and too little offensive material.

He's also weighed in on electronic books and other issues of information access, but he's probably best known for his commentaries on language on National Public Radio's program "Fresh Air." Some of those radio essays and other pieces have been collected in the book "The Way We Talk Now," including pieces on chess-playing computers, hackers, emoticons and why this era is getting the software it deserves.

CNET News.com caught up with Nunberg recently to talk about how machines struggle to make sense of the way people write and speak, and how the Internet has people writing more now than ever before.

What was a language specialist doing on staff at PARC?
PARC has a number of people working on natural language technologies--search and retrieval, natural language understanding systems, automatic translations--so it's quite natural that I would be there.

To assume that (translation software) ever--not just now, but ever--will reach the stage that people are capable of, I think is a mistake.

Did you have your hand in any kind of invention that might be well known?
When I was at PARC, I did work on natural language classification systems--systems that could select text according to their genre in addition to their topic, so that they could, for example, tell a newspaper editorial from a news story.

Machines don't seem to be very good at understanding natural language.
It's very hard. People don't appreciate just how difficult it is to understand natural language and what an extraordinary accomplishment ordinary people are performing when they simply have a casual conversation. And that's one of the things that's made it possible for software companies to hype the technology with claims that grossly exceed anything that software is capable of now, or ever.

I was just involved as an expert witness on behalf of the American Library Association in their successful challenge to the Children's Internet Protection Act. That was an act that mandated that libraries that receive these "e-rate" subsidies should use filters for Internet access. And the filters, which work on natural language technology, which try to distinguish pornographic or obscene sites simply by the language that they contain, are a good example of just how hard it is to do this and how inadequate systems are in doing it.

In this case, how did that play out?
One of the wonderful things about the Internet, of course, is that kids--in particular (those) who are reluctant to ask parents about things like drugs and suicide and sex and so on--can find this information on the Web. But these filters were routinely blocking numerous sites of that sort, including very useful, just random things. The Canadian home page of the Discovery Channel was blocked, a Latin music site we looked at was blocked--just because something in the site triggered some goofy filter and because they're software, and software is buggy. So that's a good example of just how bad these things are at trying to reproduce human capacities. That isn't to say that the software can't do very useful things, provided there are people screening it on the other end.

I was going to ask you, is there any hope for the software?
If the FBI is interested in trolling for child pornography sites, it's perfectly reasonable for them to use software like this if they sift through the results to see if they come up with anything that's genuinely pornographic. That's very different from software that just says, "You can see this, you can't see this," and doesn't involve human review of the process. Although these (filtering) companies claim they use human review for all sites, that's just not true. And it couldn't be done, given the size of the Internet.

If people seriously relied on their grammar-checkers, they wouldn't write particularly grammatical prose.

If you're interested in machine translation--if you want software that translates "Madame Bovary," (that) gives you an adequate translation of packing instructions on an Italian page, (machine translation) won't be able to do that. But if you receive a letter, or if you want to know whether a hotel in Samarkand has a swimming pool or something like that, then it's often adequate. The translation software, flawed as it is, often is adequate for just getting a gist of what's on the page. So it's extremely useful, and I've worked on the software, and I think it's terrific, but to assume that it ever--not just now, but ever--will reach the stage that people are capable of, I think is a mistake.

What about grammar-checkers?
It's hard enough, as we know, to check spelling, because of the problems of homonyms and so on, distinguishing t-h-e-r-e and t-h-e-i-r, and so on, but spelling is a fairly routine matter. Words are either right or wrong, and there isn't a lot of sophisticated judgment. That's why the ability to spell seems singularly independent of any other intellectual capacity...Spelling is that kind of skill which a fourth-grader can basically master.

Grammar, and writing correct and even effective English, is a skill that is, as any writer knows, one that nobody ever feels they've mastered. But certainly even at a basic level, it requires enormous amounts of discrimination and intelligence. Machines just can't do it. What's happened is that the grammar-checkers have been part of the dumbing-down of grammar because they're not capable of making the kind of discriminations that human editors are, so that all a grammar checker can say is, "Never split an infinitive."

So should we be worried about the language, then? Is there a sort of a global warming threat here?
Technology's just part of it. I don't think these are responsible for the dumbing-down. I think that began earlier. But they play along with the dumbing-down...If people seriously relied on their grammar-checkers, they wouldn't write particularly grammatical prose. And they would wind up following rules, by rote, that weren't made to be followed by rote.

But that's par for the course. I mean, this is very hard to do. Linguists who are working on technology that can do natural language understanding are aware of how enormously difficult it is to do this...The rest of us do it so easily and unconsciously, and we say any idiot can do it, so we think it's easier than playing chess, and that's just wrong.

Is the language changing at all with the new technology?
It depends what you mean when you say "the language." We've added a few new words, but we always do that with new technology. Sailing gave us a whole bunch of new words, railroads, aviation--computers are giving us a bit more of that than they did. People, of course, are fascinated by that. But it's not a big deal in any sense--so we'll have a bunch of new words from technology.

I think there are two more-interesting consequences for language, at least. One is the fact that huge numbers of people are communicating online either via e-mail or discussion lists or forums or Web pages. The number of writers, the proportion of writers to readers in society--which has been growing slowly--has changed enormously in a short period of time. And that's a very interesting difference, it's one of the things that explains the impression that grammar is going downhill. Because you go online, and you see it seems that nobody knows when to (use) an apostrophe...But those are people who never knew when to put an apostrophe on "it's."

The second consequence of that--particularly forums and e-mail and so on--is that the language of public discussion--and blogs are a good example of this--has gone from the kind of high, neutral, public style that's exemplified by the op-ed pages of The New York Times, to something more informal, more colloquial, more conversational, which rests more, in fact, on the norms of middle-class speaking. It's something poised, as it were, between the formal style of official journalism and the informal conversations that we have with one another. And that's a very interesting development, it's a profound development--in one sense, it opens the discussion to a larger number of people. In another sense, it closes the discussion to people who aren't familiar with the implicit norms of that kind of interaction.

So, I think in particular, people who haven't learned to talk around middle-class dinner tables may be more disenfranchised or marginalized by those styles than by the neutral style of the press, which ostensibly is something that's independent of class or background. That's actually a concern. It's one that's very striking when you talk to foreign scientists. I was talking to this French scientist--a physicist, a very smart guy, and he publishes repeatedly in English as any physicist would have to, goes to conferences, reads papers in English. He says, "I don't know what these people are talking about when I go into Usenet..." It's a casual, more colloquial, sometimes slangy English.

To paraphrase the title of an essay of yours, the Internet will always speak English--but not just English.
The Web in particular has redressed the disparities between major and minor languages that were characteristic of the age of print. When the technology first became available, people thought it was going to be the royal road to triumph for the English language. And it's certainly true that lots of Web content is in English, even in these other countries, because every publication is now potentially an international publication.

But what's happened is that smaller languages whose range used to be restricted now have international scope. So Greeks or Persians or Koreans who live in California, who before this could never have any access to the resources in their own language, can now get their newspapers and get that information on the Web. And it has interesting consequences for marginalized parts of the speech community--for instance, the French speakers of the Maghreb, the Russian speakers in parts of Eastern Europe, the Portuguese speakers in parts of Africa. If you look at these communities, particularly these diasporas, they have a contact with their linguistic community that they could never have had in the age of print and broadcast.

You've also given some study to e-books, which people had a lot of hope for, but they haven't taken off. Why is that?
There are two ideas. One is the idea that this technology will replace the book. That's just silly. The book is ideally suited for doing what the book does, particularly if you're talking about the kinds of writing that's traditionally done the heavy lifting, culturally speaking--the novel, the history, biography, criticism, that sort of (thing). That sort of stuff is just going to be in books. We'll have electronic versions of them, and those are useful to have, but those won't be the primary means for sustained reading of these things.

If you're thinking of them not as e-books but as portable document readers, then sure, there'll be an enormous amount--any time you want documents, whether you're a doctor walking around the wards wanting access to medical records, or any of a number of different circumstances where you might want access to documents without having access immediately to a networked machine. That's a real win.

In a way, who cares? From the economic point of view these books that we're talking about, novels and so on and so forth, are a tiny part of the number of documents that are available. But I think it became for both sides a kind of test of the cultural significance of the new technology.

You compared PowerPoint slides to the stained glass windows in Gothic cathedrals. Pretty high praise for software that's about mission statements and quarterly numbers.
(Laughing) I almost certainly meant that ironically. But the connection is this: That when we talk about the future of the technology as a medium, particularly for communication, reading and writing, and particularly for communication of that sort--I'm as guilty of this as anybody, because I edited a collection called The Future of the Book, as if the question is, will the technology replace the book?--what we really should be asking about is not the book, but writing.

And the book, or bibliographic writing as some people have called it, is one mode of writing that's been dominant for the last 300 years maybe, as opposed to epigraphic writing--writing on walls and surfaces and so on--and there's historically been this kind of balance between them. There are periods like classical antiquity or during the Renaissance, when epigraphic writing was really the culturally central form and the book was more marginalized.

Now we've reached the point where the book is the sort of the center of our culture, and epigraphy, which does enormous amounts of work in our culture, still tends to be culturally marginalized, whether it's advertising or posters or writing on the sides of walls...It's really been a marginal form. But when you think the way we talk about the Web, it's much more the model of the epigraphic than the bibliographic; we talk about sites, we talk about "going there," we talk about posting things. And so you can think about this as a kind of new epigraphy. And PowerPoint is an epigraphic form in that sense--it's writing on a wall.

Going back to "blog" for a second, you've said that it's "a syllable whose time has come." Is there something about the word itself?
Well, it's kind of a cute word. It's kind of like a hacker's term, it's like mung or munge or something like that. It's the opposite of a techie term, a way of demystifying the technology...I think that when "blog" comes along, that in a certain sense suggests the diffusion of that sensibility among the larger public...I mean, why say "e-journal" or "cybercommentary"? It's just a blog.