Tech Industry

Gates still finding his voice

Following launch of corporate telephony software, Microsoft's chairman discusses how speech recognition has made inroads and where it has yet to go. Video: Gates' crystal ball Video: Phones should call person, not number

SAN FRANCISCO--Bill Gates has been saying for years that one day soon we will use handwriting, voice and touch to control our computers.

He's still saying that. In an interview with CNET, Gates talks about some of the ways that speech recognition has already made inroads and discusses some of the places it will eventually go.

Following the launch of Microsoft's new corporate telephony software, Gates discussed how come the business phone remained the same for so long and how much it can change once it is made part of the same network as the PC. Gates also talked about the possibilities of touch-screen computing, noting how popular the idea of multitouch has been, both on Microsoft's tabletop computer, Surface, and on the iPhone.

Although he plans to at Microsoft next year, Gates has said he will keep a few key projects under his purview and suggested the natural language interface push is one he'll probably keep working on. Search and the future of Office are also on the short list.

Q: When did you really first see the possibilities of voice? Was there a real early demo you saw years ago that sort of--you saw it and could really see the possibilities?
Gates: Well, certainly the idea that computers should deal with voice has been around a long time. It's kind of a natural way to communicate. In the 1970s, DARPA was funding people, including people at Harvard, to do speech recognition. And so people kind of thought, hey, this should be easy to do. The dream of computers understanding voice goes way back. And the dream that the data network and the voice network would be one and the same goes way back as well.

The dream of computers understanding voice goes way back. And the dream that the data network and the voice network would be one and the same goes way back as well.

Microsoft early on took it that, hey, the magic of software would come to bear on both--not just data networks, but also voice networks and video networks, and we got very involved in that. The real surprise to us, frankly, was that because that world was essentially satisfactory, people were so unwilling to take a risk to move, particularly (moving) business phone calls over onto a new platform.

These PBXs (the private branch exchange systems businesses use to manage phone calls) that are really--they're just computers--have existed alongside the normal infrastructure. Their wiring has stayed there, their directory, their server piece. And so we've been patiently sort of investing in this. In fact, in 1999 we got our first large-scale voice, PBX-type work under way.

In the coming years, the conference table will be a computer, the whiteboard will be a computer, says Microsoft Chairman Bill Gates.

Bill Gates discusses how Unified Communications software will finally modernize the business phone. Just click to see if and how people are available, whether via phone, e-mail or IM.

And so I assume at that point you thought it was going to happen sooner?
Gates: As we take the magic of software to new things, it's OK to be too early. We don't want to be in too late. And so we saw that the pieces were starting to come together. And so it made sense for us to invest. We wanted to be there, particularly as Exchange and Outlook and Office had gotten so strong, you know, people used us to do everything but the telephony piece. The idea that, OK, now we should encompass telephony and do that kind of sat there as a clear, big opportunity for us.

The thing that's happened over the last eight years is this willingness that we now have enough customers who have had very good experiences using Internet transport, bringing the PC into the picture.

With speech recognition, one of the ideas is that there are some applications where it can pay off, even if it is not getting 100 percent recognition. Is finding some of those areas one of the keys to speech recognition being mainstream?
Gates: That's right. Remember, the stuff we're doing with unified communications, speech recognition is not actually a very key element of what goes on. There are some aspects of it. For example, when you're doing audio conferencing in our world, we can tell you who's speaking. And that's very frustrating today in traditional audio conferencing that you don't know who's come and gone, and somebody can speak up and you don't know who that is.

Or with RoundTable (Microsoft's 360-degree video conferencing camera), we use video and audio clues to tell who's speaking and bringing the focus on that. And you always have the full room view at the bottom, but you have that zoomed-in view as well. And so, you know, if it gets it slightly wrong, you can look at the full-room view and see exactly what's going on. And just like if the cameraman was focusing on something different you were interested in, well, the wide view takes care of that.

When you want to search something (in a meeting) if a word sounds like one of three things, for the search case, you can just index all three. And the fact that you might get some false positives, that is, when you do a search, you might get some part of the speech where a similar sounding word was being used, it's not that big a deal. You'll just look at it, skip past it. And so not being perfect is not a huge problem. And I imagine that's going to be a huge change in video search, for example. Today when we have video searches, you are basically searching keywords of the Internet page that surrounds the video, the description, that sort of thing. When we start using voice recognition to search within the videos, we'll have a much more powerful experience, right?
Gates: Yeah, that will help a lot. Microsoft Research has some amazing demos around that. In terms of broadcast videos, of course, there's the requirement that there be the text annotation. So if you have that, you actually have the speech-to-text that has been done for the deaf listener, anybody who wants the captioning-type capability. So there's a lot of video out there where if you ingest it in the right way, that's available. For the bottoms-up video, or just a meeting you have in the business, then you're relying on the speech recognition software to make it easy to navigate.

What are some of the areas where you see voice going that people aren't necessarily thinking about today?
Gates: To me, voice is in the broad realm of natural interface. And natural interface is (the notion of) screens everywhere--screen in your desk, screen in your tables, screen on your walls, no more white boards, touching, which is like Surface, where you can manipulate things. It's a pen so you can have ink wherever you want. You know, pull up an article, write a little note on it and get it sent off to a friend.

The speech recognition comes into it--all these things about natural interface are coming to the fore, and they are probably the thing that's most underestimated right now about the digital revolution. People kind of gasp when they see how touch works on Surface, when they touch their iPhone then, "Ooooh, wow," you know, that's just such a natural thing.

When voice recognition is used in the right way--let's say you're in the car and you want to pick somebody to call--that's improved very dramatically, or speech output, text to speech, these things have gotten very good.

You talked about different natural language interfaces. You know, with multitouch, it seems to have really captured people's imaginations, both with what you guys have shown with Surface, certainly with the iPhone. Voice seems to be a little slower in terms of speech recognition as a mainstream computer interface.
Gates: Well, that's fair. Voice recognition is a harder thing. There are certainly tons of people, and I mean millions, who for some reason, the keyboard's not attractive to them. Either they have repetitive stress injury, or they're in a work environment where they're doing something else with their hands, where they've taken the time to learn the software and adapt to the software and gone through the training process there. And they love it. They can't believe other people don't use it.

When you sell a product to hundreds of millions of users, there are features that millions of users love that you can call an obscure feature because, percentage wise, it's not very many.

For the rest of us, the keyboard has worked so well that we are even getting the keyboard into phones. I think voice search on the phone is one of those applications that would really drive it forward. I mean, why should I have to try and type something in? I've got a phone, I've got a talk button; so that's one of the areas we're betting on.

You guys built a pretty significant voice recognition engine into Vista. It hardly gets talked about. Are you surprised that some of the things you did in Vista aren't getting more attention?
Gates: Well, when you sell a product to hundreds of millions of users, there are features that millions of users love that you can call an obscure feature because, percentage wise, it's not very many. You know, Butler Lampson, one of our great researchers who has done great work going all the way back to his days at Xerox, was just sending me mail about how fantastic the improvements in the speech stuff are in Vista and, you know, we're hard at work on the next version of Windows. We're going to take this speech stuff even further.

What about in the developing world? I imagine natural language input, you know, particularly for people who've never used a computer, has some really interesting applications.
Gates: I wouldn't go too far on that because they're not used to what the dialogue should be like, and in most of those places, the cost of labor is low enough that, you (can) have another person on the other end of the connection or talking to them directly. But, yeah, it should work for different languages. It's particularly interesting for Japanese and Chinese where the keyboard is not as natural as it is for languages with modest-sized alphabets. And so we do see ink and voice catching on there.

There was a demo recently where there was a challenge about typists compared with voice recognition, and the voice recognition won out by quite a bit. And so there's a lot that can be done pioneering off of the demand that will come out of those markets.

You've talked a fair amount about taking on just a few projects when you step away from full-time work. Is natural language input and voice one of those areas you think you'll be spending time on?
Gates: Yeah. I'd say, broadly, the whole natural interface thing. Big screens, touch, ink, speech, that's something that I think, along with cloud computing, is the next big change in how we think about software and how it becomes more basic. And, you know, Ray Ozzie is driving our cloud computing stuff and--way ahead of me, very hands-on all that stuff. Some of the natural interface stuff, I think he and Steve will ask me to sort of keep the energy and vision alive there in a strong way. Some of that will be reading off the screen or the tablet, but the whole natural interface area probably will be one that they'll pick.

Any others that you think you will take on?
Gates: Well, it's hard to say. Search is such a fun area right now. They might pick that. There are some ideas about where the Office software should go--I'm really quite enthused about some things. So I'd say those are the three most likely. And it's only going to be three or four, so--they'll have to decide.