Gesture recognition will be a slam-dunk, startup head says (Q&A)
User interfaces are changing with touch, voice, and -- now -- gestures. EyeSight Mobile CEO Gideon Shmuel believes gesture recognition will transform our expectations of PCs, phones, TVs, and tablets.
If Gideon Shmuel gets his way, you'll soon be waving your arms and pointing your fingers at your TV, phone, tablet, and PC.
As chief executive of EyeSight Mobile Technologies, Shmuel is promoting the idea of gesture recognition, in which sensors detect your body's motions and do things like open an app or change channels. The best example of gesture recognition is Microsoft's Kinect game controller, but EyeSight wants to bring gesture recognition to all the electronic devices in a person's life.
Eyesight, an Israeli company, is in the midst of dramatic change to electronics interfaces. After years with keyboards and mice, smartphones and tablets have popularized touch screens, and services like Google's Android and Apple's Siri are making voice control a mainstream reality, too. Gesture recognition, like touch and speech, is in many ways a more natural form of interaction.
Shmuel discussed the troubles of Leap Motion, the advantages of graphic acceleration, gesture recognition's new precision, and other aspects of the technology with CNET's Stephen Shankland. Here's a lightly edited transcript of the conversation.
Stephen Shankland: The world learned about gesture recognition through the Nintendo Wii and later the Xbox Kinect game controller. Where has the state of the art gone since then?
Gideon Shmuel: The Wii is using infrared handheld gyroscope-based devices. It's not very precise, so it's very hard to press on a small icon. You can do gross movements, but typing something with that is hard. The Kinect uses something called structured light with active illumination and a depth sensor. The first Kinect used PrimeSense, which was recently sold to Apple. It's cheap, but it needs a lot of light and, it uses a lot of energy, and it uses a lot of processing power.
It's mostly that'll be plugged into a wall?
Correct. The new Kinect is not PrimeSense. Microsoft acquired Canesta a few years ago. They moved from structured light to time-of-flight -- a different method of active illumination, where you have infrared or lasers illuminating the room, then the sensors can see where the objects are in space and track them.
Today, with the technology we developed, we're able to track very small objects, down to a fingertip, at very high accuracy and at pretty good distance. Today we have the ability to track hands or fingers, even at 5 meters, in a living room space, using normal VGA [a low 640x480 resolution] cameras -- it doesn't need to expensive sensors or a specific chipset. Obviously getting active illumination sensor gives a lot more information, and we know how to use that as well. We know how to combine the depth information and what the VGA camera sees to bring an immersive experience to the user.
We do that with very efficient performance. When you're working with mobile device or TV, you have very limited CPU power you can consume. They're not putting specific hardware in for that, so you're allowed to take very little MHz [processor horsepower] from the device.
What does finger-scale accuracy at 5 meters' distance get people?
A very easy interaction. With smart TVs, you have so much information, so many applications, all the casual games, browsing. We see new UIs UI [user interfaces]coming out in these devices that are giving a "wow" experience. When you raise your finger the first time, you go, "Wow." I control iTunes. When my computer is on, I just move my finger a small movement to the right, I shuffle to the next song.
One problem with gesture recognition is precision. If you've got finger-width resolution, what does that let you do? Could you put a virtual keyboard on a screen and type easily?
The reason we are moving from hand detection to finger detection is about ease of use. What we saw in TVs with other technology, is that you have to perform pretty big movement with your arm at shoulder level, which is not convenient for the user. And it's very hard to be precise. We worked hard at finger-level tracking. You control everything by the movement of your wrist. It's very easy to do.
With the accuracy, if go to YouTube, you can select a song, you can press the icon to maximize to full screen. It's really accurate and down to very small icons. You can use a virtual keyboard, not for really fast typing, but letter by letter is not a problem.
We're working on other solutions. We integrated with 8pen. It's a cool keyboard -- if you move your hand in circles you can type very quickly. We're testing things like Swype and different keying methodologies that can be used with our technology.
When were you founded, and what do you do?
We were founded around 2005 as a small garage kind of idea by the founder who today is the CTO. His idea was to bring a touch screen to mobile devices. We are the only company in the space that started by trying to use machine vision capabilities in a low-power, low-camera-quality device that is on the move. Trying to bring all of that together wasn't easy. Machine vision algorithms are pretty processor-intensive. A small team of engineers tried to solve the problem. There was a lot of trial and error because of all the constraints, but eventually they managed to develop a very strong algorithm that became the foundation of what we do today, which is very diverse.
EyeSight is a software company. If you look at market of natural user interaction, gestures, and user awareness, it's really divided into hardware and software. One cannot live without the other. You have a variety of sensors, chipsets that will perform calculations for depth map, and software that can interpret what the user is doing. Some software is better than other in terms of reducing noise. I scratched my nose: was that a gesture or not? We work across sensors, from the normal sensors you have today in mobile phones, PCs, and TVs, to stereoscopic, to infrared, to depth sensors. Our software can work on these various input methods.
We have a variety of capabilities to distinguish various things. We can identify directional gestures. We can identify objects like hands and fingers with very high granularity. Even with a normal sensor, I can do pixel-level finger tracking -- if you have an icon to minimize or maximize a window, I can press on such a thing in the air. Or I can grab an icon or object and move it in space. We can identify signs, like a shush with my finger on my lips.
We combine these algorithms per application. If you're in a Metro UI and swipe left and right, it would be different than in a photo application where you can also zoom in and zoom out, or a game where you fly through it with your finger.
Then you license this technology to companies who deliver it?
Yes. Currently we work with tier-one OEMs [original equipment manufacturers like computer makers] and very big chip companies that are embedding our technology into their offerings. We have very flexible algorithms that can set in application processor, sit on chip on a camera, it can sit on a GPU. With AMD and ARM, we are their gesture-control vendor running on the GPU not the CPU, which gives us a lot of extra capabilities.
We are diversifying and starting to develop downloadable apps and work with developers, opening an SDK [software developer kit] to create a greater buzz and ecosystem.
We're shipping with major vendors, like Lenovo with the X1 Carbon and tablets and Toshiba PCs. Philips TVs are coming out with us. HiSense, Oppo, an up-and coming phone maker.
Who are your competitors?
We don't see hardware as a competitor, even though sensor makers are trying to develop some middleware. In software, there are a couple companies. There's another Israeli company called PointGrab that I'd say is our main competitor.
When you have three people on the couch, is there a problem figuring out who can change the channel or who is the person playing a game?
Because we came from mobile and worked so much on distinguishing between what is background noise and what is a planned action, we use it in all our solutions. In the TV and living room space, the person who raise the hand or finger the first time. We are very strict in terms of how you activate the system in terms of of getting control, and how many people can get control.
I'd worry then that it's not flexible enough when you do want to change. What if I fall asleep on the couch and my son wants to control a TV?
If you lower your hand and he raises his hand or finger, he will get control.
I have the Waze app for traffic, and supposedly you can get to its interface by using gesture recognition, but I haven't got it to work yet. Can you give examples with what people could do with gesture recognition on a phone?
Sure. It will evolve. What you said about Waze is correct. The Galaxy S4 has proximity sensor based directional gestures, which means you have to be very near to the phone and use a full-hand movement. It's not very comfortable or intuitive. We have directional movement -- up, down, left, right, select -- and you can be 40cm away from the device and do very easy, very flexible movements. You can flick with your finger.
Today we are shipping in phones. There's no need to constantly hold your phone if a call comes in and you want to answer or decline it, or you're listening to music and want to flip between songs, or in a coffee shop and want to browse. You can do all this media consumption very intuitive.
Tracking of hands and fingers is working as well. You can take a tablet and turn it into a Kinect. I can play Fruit Ninja from 2 meters away. For almost any game, we can turn a device into a Kinect.
OEMs are thinking not just about connecting gestures to existing apps and UIs. They're starting to think about a gesture mode. When you put a phone into gesture mode, how do you change the look and feel of the application if you're in the car or in the kitchen? There will be apps soon coming out with our technology for automotive environments for Android.
The Leap Motion controller, a gesture recognition sensor that plugs into a PC's USB port, has been around for a while, but I don't see people flocking to it. What's holding gesture recognition back? Why, if it's so great, has there been no real "aha!" moment?
Leap for me is a short-range, PC-only Kinect. I switch it on when I want to play a specific game. It's not something I use as part of my environment. The promise of Leap was huge. They did a lot of marketing and PR, but if you read blogs or see the amount of devices on eBay for $30, the delivery didn't meet expectations. This is why people didn't say "aha!" There aren't enough apps yet because they're working on the basis of an app store. It takes a lot of time and money to build an app store with a lot of content.
We're looking at the world differently. We believe in always-on. The system should be clever enough to identify if I'm performing a gesture or not. With my PC open, and drink my cup of coffee, scratch my nose, or move my hands while talking with my friends, and the system will be able to identify if it's noise or not noise. When I show show my finger or do a swipe, it will identify me and know to distinguish this from the noise. And we believe in becoming part of the user interface. If you look at the Lenovo or Toshiba devices, they have mouse, they have keyboard, and they have gestures that are always on. It's not something you have to switch on in a specific app or switch off.
I'm curious about your partnership with ARM and AMD. What does running on a graphics processing unit [GPU] get you? GPUs are good at processing tasks that operate in parallel.
Basically GPU are great for pixels -- either showing pixels in a graphical manner or processing pixels that come in through the imaging pipeline. We are real-time, so we work at 30 frames per second. When I move my finger across screen to slice the fruit, I want to feel almost no latency [lagging response]. With the CPU, we can do that, but with the GPU, because of parallel processing of pixels, we are able to get very high numbers of frames per second. We can offload processing from the CPU, and we bring new algorithms for noise reduction.
We showed work with AMD at CES: stereoscopic depth mapping and gesture recognition in the GPU. All depth maps out there are using silicon -- a specific brain to process it, because it's too heavy for the CPU. We're doing it on the GPU, giving an accurate depth map and are able to track a finger in the depth map only on the GPU. We're not making OEMs pay extra money for extra silicon and expensive sensors, just normal stereoscopic cameras. That's a huge accomplishment.
Depth mapping is analyzing a video stream to figure out how far away subjects in the scene are from the camera, correct?
How does that improve gesture recognition?
If I have my finger in space and I want to poke forward to select something, you want to see the depth of the finger in space as it's moving forward. Even with a single normal VGA sensor, we're able to bring accurate depth, like half a centimeter depth in terms of movement. But with a real depth map, we can bring even higher accuracy and better robustness.
At what distance can you get half-centimeter accuracy? How far away from the camera?
About 2 meters.
One of the problems with multitouch gestures on PCs is that beyond some gestures like pinch-to-zoom, there's no set vocabulary of gestures. A three-finger swipe might do totally different things on different PCs and nothing on a mobile phone or tablet. Do we have that problem with gestures? Does there need to be some sort of standardized vocabulary of what this or that gesture means?
Initially we saw OEMs say, "I want my gestures to different because I don't want to look like Lenovo." That's different now. We're starting to see a common denominator.
Because we've learned so much, we're now recommending a language that's pretty simple for the user. I don't want to call it a standard, but hopefully eventually it will become almost like a standard. The secret is to keep it simple. You can do very complex multifinger gestures. It looks great on a demo screen, but how do you control a normal application with that? My little finger is now in the scene -- what does it do to the application? I didn't want it to cause a false detection. There's a lot of complexity between a nice cool demo and how it impacts an application or a user interface.