CNET también está disponible en español.

Ir a español

Don't show this again


Robots learn to cook by watching YouTube

When it comes to learning how to cook, it turns out that robots may not be so different from humans after all... or are they?

John T. Consoli, UMD

When it comes to teaching robots how to do things, there are some very key differences. A human knows what you mean when you say "I need a cup". A robot needs to be taught that that means it has to turn around, go to the cupboard, open it, take out the cup, close the cupboard, turn back around, return to you, manoeuvre the cup over the bench, and release the cup.

This is one of the key parts of figuring out machine learning: How can you program a robot so that it can intuit that a plastic cup, a glass and a mug may all be classified under the general term "cup"? How can you design a robot that is able to teach itself?

One way, as researchers at the University of Maryland Institute for Advanced Computer Studies are finding out, is YouTube. More specifically, cooking tutorials on YouTube. By watching these videos, robots are able to learn the complicated series of grasping and manipulation motions required for cooking by observing what humans do on the Internet.

"We chose cooking videos because everyone has done it," said UMD professor of computer science and director of the UMIACS Computer Vision Lab Yiannis Aloimonos. "But cooking is complex in terms of manipulation, the steps involved and the tools you use. If you want to cut a cucumber, for example, you need to grab the knife, move it into place, make the cut and observe the results to make sure you did them properly."

The robot uses several key systems in order to learn from YouTube videos. Computer vision, with two different recognition systems, allows the robot to visually process how the presenter grabs something, artificial intelligence processes that information, and finally language parsing helps it understand spoken commands and translate it into an action.

In this way, the robot can gather individual steps from various videos and assign them rules according to its programming, putting them together in the correct order.

"We are trying to create a technology so that robots eventually can interact with humans," said UMIACS associate research scientist Cornelia Fermüller. "For that, we need tools so that the robots can pick up a human's actions and track them in real time. We are interested in understanding all of these components. How is an action performed by humans? How is it perceived by humans? What are the cognitive processes behind it?"

The difference, the team said, between their research and previous efforts is that they are concentrating on the goal, not the steps. The robot can draw on their databank of actions to string them together achieve the goal, rather than copying verbatim, step-by-step, a series of actions.

It's a system that apparently works. According to the team's paper, the grasping recognition module had an average precision of 77 percent, and an average recall rate of 76 percent. For the object recognition module, the robot achieved an average precision of 93 percent and an average recall of 93 percent. Overall, recognition accuracy on objects was 79 percent, grasping was 91 percent, and predicted actions was 83 percent. The drop in object recognition accuracy was because the robot had not been trained on some objects, such as tofu.

"By having flexible robots, we're contributing to the next phase of automation. This will be the next industrial revolution," said Aloimonos. "We will have smart manufacturing environments and completely automated warehouses. It would be great to use autonomous robots for dangerous work -- to defuse bombs and clean up nuclear disasters such as the Fukushima event. We have demonstrated that it is possible for humanoid robots to do our human jobs."

The team will be presenting their research at the Association for the Advancement of Artificial Intelligence Conference in Austin, Texas, on January 29, 2015.