Yesterday, Google AI introduced in its blog the latest achievement tossingbot, a picking robot that can learn to grasp objects in a real and random world and throw them to a designated location outside the customary range. AI technology review compiles it as follows.
Although considerable progress has been made in object grasping, visual adaptation, and learning from real experience, we still need to consider how robots perform tasks such as grasping, processing, and placing objects, especially in irregular settings. Let’s look at the robot that won the first place in the loading task of Amazon robot challenge
This is an impressive system with many design functions that can prevent objects from falling due to unforeseen forces from a kinematic point of view: from a stable and leisurely motion track to a mechanical clamp that limits the momentum of an object, all of which guarantee the realization of this function.
Like other robots, it was originally designed to adapt to the dynamics of an irregular world. The question here is, apart from simply adapting to dynamic factors, can’t robots learn to use them effectively and develop physical “intuition” so that they can perform their assigned tasks more effectively? By doing so, it may be able to effectively improve the robot’s ability to move, and then master more complex motion skills, such as throwing things, sliding, rotating, swinging, or capturing, etc., which will bring many potential applications, such as the debris cleaning robot with efficient operation in disaster scenes, in which the time is often raced against time.
To further explore this concept, we worked with researchers from Princeton University, Columbia University and Massachusetts Institute of technology to develop tossingbot: a pickup robot that can learn to grasp objects in a real, random world and throw them to a designated location outside their usual range. By learning to throw, tossingbot can achieve twice the pick-up speed of the previous system, and achieve twice the effective placement range. Tossingbot uses an end-to-end neural network mapping from visual observation (rgb-d image) to control parameters of motion primitives to learn grasping and throwing strategies. By tracking the landing position of objects by overhead camera, tossingbot can gradually improve itself with the help of self-monitoring mechanism.
Throwing is a very difficult task, which mainly depends on many factors: from the way the object is picked up (i.e. “pre throwing conditions”) to the physical properties of the object (such as mass, friction, aerodynamics, etc.). For example, if you grab a screwdriver in a handle close to the center of mass and throw it away, its landing position will be closer to you than if you grab and throw it from the metal tip, which will swing forward and land further away from you. It should be emphasized that no matter what the picking method is, there is a big difference between throwing a screwdriver and throwing a table tennis ball. The table tennis ball will fall closer to you due to air resistance. It is almost impossible to design a solution that can handle these factors properly by hand.
Throwing depends on multiple factors: from how to pick it up to the properties and dynamics of the object
With deep learning, our robots can learn from experience without relying on manual case by case engineering. In the past, we have proved that our robots can learn how to push and grasp various objects. However, in order to accurately throw objects, we need to have a deep understanding of projectile physics. It is not only time-consuming and money-consuming to try to acquire such knowledge through repeated experiments, but also can not be competent for the tasks that are not specific enough and do not set up the training program carefully.
The combination of physics and deep learning
Tossingbot learns throwing by integrating basic physics and deep learning, so that it can be quickly trained and applied in new scenes. The controller provides a priori model of how we can use physics to develop a model of how the robot works. For example, in the throwing scene, we can use ballistics to help us estimate the throwing speed required to make the object fall to the target position. Then, the neural network is used to predict the adjustment based on physical estimation to cope with possible unknown dynamics, such as noise and change in the real world. We call this hybrid scheme residual physics, which enables tossingbot to achieve 85% throwing accuracy.
At the beginning of the training, with the initial weight randomization, tossingbot repeatedly tries to grasp less accurately. Over time, tossingbot gradually learned to grasp objects in a better way and improve their throwing level at the same time. In the process, the robot occasionally throws objects at a speed that has never been tried before to explore what happens next. When the dustbin is emptied, tossingbot will take the initiative to lift the box so that objects can slide back into the bin. In this way, human intervention during training is minimized. After 10000 attempts of grasping and throwing (or equivalent to 14 hours of training time), it finally achieves 85% throwing accuracy and 87% grasping reliability in a chaotic environment.
Promote to new scene
Through the integration of physics and deep learning, tossingbot can quickly adapt to the positions and objects that have not appeared before. For example, when we train them with simple shaped objects such as wood blocks, balls and markers, they can then cope well with new objects such as plastic fruits, decorative items and office items. In the task of grasping and throwing new objects, tossingbot may perform fairly well at the beginning. However, after hundreds of training steps (one or two hours), tossingbot can quickly adapt to and achieve the same performance as the training object. We found that the combination of physics, deep learning and residual physics can achieve better performance than the baseline scheme. We even started the task ourselves and were surprised to find that tossingbot was more accurate than any of our engineers! Even so, we haven’t yet tested it against people with athletic talent.
Tossingbot capabilities can be easily extended to new objects and perform more accurately than ordinary Google employees
We also tested a strategy that can be extended to a new target position that has not appeared in the training process in the past. To this end, we first put the model on one set of boxes for training, and then selected another group of boxes with different landing areas for testing. In this case, we find that the physical theory of residual behind the throwing is very obvious. The initial estimation of the throwing speed by ballistics can help us to deduce the new target position, and the residual theory can be adjusted on the basis of these estimates to cope with the changes of different object attributes in the real world. This is in sharp contrast to the baseline method, which only uses deep learning, which can only deal with the target position seen during training.
Tossingbot throws objects into unforeseen positions based on the theory of residual physics
Semantic extension based on Interaction
In order to understand the learning content of tossingbot, we put several objects in the box and input them into the training neural network of tossingbot to extract the deep features of intermediate pixels. We cluster the features based on similarity, and visualize the nearest neighbor as a thermal map (the hotter the region is, the more similarity the feature space has), so that all table tennis balls in the scene can be accurately located. Even though the orange wall has a similar color to table tennis, its features are enough to make tossingbot distinguish. In the same way, we can use the extracted features to locate all markers, even if the markers have similar shape and weight, and are not the same in color. More learning cues (such as toss) depending on the shape of the toss may show. In addition, the learned features may also reflect advanced attributes (such as physical attributes), which determine how the object should be thrown.
In the absence of explicit supervision, tossingbot learned the deep features of distinguishing object categories.
These emerging functions are learned from scratch without any clear supervision, except for task level grab and throw tasks. It seems to be enough for the system to distinguish between object categories (such as table tennis and markers mentioned earlier). This experiment illustrates a broad concept related to machine vision: how should robots learn the semantics of the visual world? From the perspective of classical computer vision, semantics are usually defined in advance through the distinction between artificial image data sets and artificially constructed categories. However, our experimental results show that the model can implicitly acquire object level semantics from physical interactions as long as the task of the opponent is important. The more complex these interactions are, the higher the resolution of semantics. For general-purpose intelligent robots – perhaps it is enough for them to develop their own semantic concepts through interaction without human intervention.
Limitations and Prospects
Although tossingbot’s results seem promising, there are still limitations. For example, it assumes that all objects are strong enough to withstand a landing collision after being thrown – which requires further work to learn to throw against fragile objects, or to train other robots to grab objects in buffered landing. In addition, tossingbot can only infer control parameters from visual data – and exploring additional senses, such as torque or touch, can actually make the system respond better to new objects.
The combination of physics and deep learning leads tossingbot to an interesting question: what other areas can benefit from residual physics? How to derive this idea to other types of tasks and interactions is a promising direction in future research.