As another step toward enabling robots to work effectively in complex environments, robotics researchers from NVIDIA have developed a novel deep learning-based system that allows a robot to perceive household objects in its environment for grasping the objects and interacting with them. With this technique, the robot can perform simple pick-and-place operations on known household objects, such as handing an object to a person or grasping an object out of a person’s hand.
The research allows robots to precisely infer the pose of objects around them from a standard RGB camera. Knowing the 3D position and orientation of objects in a scene, often referred to as 6-DoF (degrees of freedom) pose is critical, as it allows robots to manipulate objects even when those objects are not in the same place every time.
The algorithm aims to solve a disconnect in computer vision and robotics, namely, that most robots currently do not have the perception they need to be able to handle disturbances in the environment. This work is important because it is the first time in computer vision that an algorithm trained only on synthetic data (generated by a computer) can beat a state-of-the-art network trained on real images for object pose estimation on several objects of a standard benchmark. Synthetic data has the advantage over
real data in that it is possible to generate an almost unlimited amount of labeled training data for deep neural networks.
“We want robots to be able to interact with their environment in a safe and skillful manner,” said Stan Birchfield, a Principal Research Scientists at NVIDIA. “With our algorithm, and a single image, a robot can infer the 3D pose of an object for grasping and manipulating it,” he explained.
“Most industrial robots being sold today lack perception, they don’t really have a sense of the world around them,” Birchfield explained. “We’re laying the groundwork for the next generation robot, and we’re a step closer to collaborative robots with this work.”
Using NVIDIA Tesla V100 GPUs on a DGX Station, with the cuDNN -accelerated PyTorch deep learning framework, the researchers trained a deep neural network on synthetic data generated by a custom plugin developed by NVIDIA for Unreal Engine. This plugin is publicly available for other researchers to use.
“Specifically, we use a combination of non-photorealistic domain randomized (DR) data and photorealistic data to leverage the strengths of both,” the researchers stated in their paper. “These two types of data complement one another, yielding results that are much better than those achieved by either alone. Synthetic data has an additional advantage in that it avoids overfitting to a particular dataset distribution, thus producing a network that is robust to lighting changes, camera variations, and backgrounds,” the team explained.
Example images from the domain randomized (left) and photorealistic (right) datasets used for training.
The NVIDIA team was comprised of researchers Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. More insights are available by reading their paper titled, “Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects” and checkout the video.