Recently, RGBD-based category-level 6D object pose estimation has made great progress in performance, however, the need for depth information hinders its wider application. To solve this problem, this paper proposes a new method of “Object Deep Reconstruction Network” (OLD-Net). The method only inputs RGB images for class-level 6D object pose estimation. We enable direct prediction of object-level depth from monocular RGB images by morphing class-level shape priors into object-level depth and canonical NOCS representations. We introduce two new modules, Normalized Global Position Hints (NGPH) and Shape-Aware Decoupled Depth Reconstruction (SDDR) modules, to learn object-level depth and shape representations with high accuracy. Finally, the pose problem of 6D objects is addressed by aligning the predicted canonical representation with the back-projected object-level depth. Extensive experiments on the CAMERA25 and REAL275 datasets show that although our model achieves state-of-the-art performance.


The main work of this paper

In this paper, a new method for RGB-based category-level 6D object pose estimation is proposed – Object Level Depth reconstruction Network (OLD-Net). The right figure above shows the main pipeline of OLD-Net. Specifically, object-level depth and NOCS representations are simultaneously predicted from input RGB images, and the two are aligned to predict poses of 6D objects. Different from the previous methods of predicting the depth of the object region by reconstructing the grid, this paper adopts an end-to-end method to directly predict the observed depth of the object from the RGB image.

To obtain the depth of object regions, a straightforward approach is to predict scene-level depth. However, due to the diversity of the field of view, the predicted depth of field is usually coarse, resulting in the loss of object shape details. Pose estimation performance also suffers. To address this issue, we reconstruct object-level depth by directly learning deformable class-level shape priors. Compared with the predicted scene-level depth map, the reconstructed object-level depth map is more convenient to calculate and can better preserve shape details, which is beneficial to the subsequent object-level depth and NOCS representation alignment process.

To better reconstruct object-level depth, a new module, Normalized Global Position Hint (NGPH), is proposed in OLD-Net to balance scene-level global information and local feature-level shape details. NGPH is a normalized 2D detection result with camera intrinsics, providing global positional cues of the absolute depth of objects in the scene, and generalization ability to images captured by different cameras. Furthermore, shape details and absolute depth are predicted using a shape-aware decoupled depth reconstruction (SDDR) method. SDDR utilizes two independent deep networks to decouple absolute depth prediction into shape points and depth translations. Intuitively, shape points are to preserve shape details, while depth translation is to predict absolute object centers.

Besides depth, we further adopt an RGBD-based method to predict the NOCS representation of the target object. Use the discriminator during training to improve reconstruction quality. After both the NOCS representation and the observed object-level depth are predicted, we backproject the object-level depth into the point cloud. They are aligned by the Umeyama algorithm to solve the 6D object pose. We conduct extensive experiments on the CAMERA25 dataset and REAL275 dataset. The experimental results show that the method has advanced performance.


The main structure of the network

The network architecture of the main OLD-Net is shown in the figure above.

Our pipeline takes as input an image and a shape prior. Images are cropped by a trained detector (Detect-Net) to represent object-specific information. The encoder-decoder network with mean embedding is used to predict the shape priors of object categories to solve the problem of category differences.

Then, the image and shape priors are fed into OLD-Net to reconstruct the object-level depth shown at the top of the figure above. In addition, OLD-Net takes the 2D detection results and camera intrinsic information from (Detect-Net as input and normalizes them to NGPH. A shape-aware decoupled depth reconstruction scheme is adopted in OLD-Net to maintain the shape details and absolute center of the object.

Finally, the NOCS representation is predicted using a deep network. Then, we backproject the object-level depth into the point cloud and recover the object pose using the Umeyama algorithm.



As shown in the network structure of OLD-Net in the figure above, OLD-Net takes images, shape priors and NGPH as input, and first uses two MLPs and one CNN to learn image feature maps, prior features, and location features. Then, using these features, shape points and depth translations are simultaneously predicted using a Shape-Aware Decoupled Depth Reconstruction (SDDR) scheme. Finally, shape points and depth translation are recombined to get object-level depth. Next, we introduce NGPH and SDDR in detail.

Normalize global location hints

Ours aims to predict object-level depth directly from high-quality images. The most straightforward way to achieve this is to predict scene-level depth maps. However, predicting scene-level depth maps from raw images is computationally expensive. In addition, it can also lead to the loss of object shape details. And shape details are very important for our pipeline because we need to align the two 3D representations to recover the pose of the object. Therefore, we choose to predict object-level depth with images of specific objects as input. However, due to cropping and resizing operations, the image loses the absolute global position information of the object, resulting in scale blur in the predicted depth. To this end, we propose NGPH, which nicely addresses this problem by providing absolute global position information and resolving scale ambiguity.

We choose the parameters of the 2D bounding box (l, t, r, b) output by Detect-Net to form NGPH, which represents the coordinates of the left, upper, right, and lower 2D bounding boxes. This information is valid enough to feed the network to infer scale cues. For example, if all images were captured by the same camera, recover the absolute depth of the object. However, images are usually collected by different cameras. It is common sense that inferring the absolute depth of an object from a monocular RGB image is affected by the camera itself. Therefore, we propose to inject camera intrinsics into NGPH. Therefore, the trained network can also be generalized to other images captured by different cameras. We use 2D bounding boxes and camera intrinsics by normalizing them to canonical coordinates: where represents the final NGPH we use. and represent the coordinates of the optical center of the camera, and represent the focal length. The first two terms normalize the bounding box size of the object to the focal length, removing scale blur caused by the object size. The last four terms normalize the center of the object using the focal length and size of the bounding box, removing ambiguity but preserving positional information. Although simple, this method is essential in object-level deep reconstruction. Experimental results show that NGPH, although simple, is indispensable in object-level depth reconstruction.

Shape-aware decoupled depth reconstruction

The features used in OLD-Net are the reshaped image feature matrix (where is the number of pixels), the location feature , and the prior feature . We also apply MLPs and adaptive pooling , resulting in a global image feature set , and a global prior feature set .

Shape point prediction:We adopt the idea of ​​Shape Prior Deformation (SPD) to reconstruct shape points, which will provide more object shape constraints to the model. Specifically, with the above features, the network will learn a deformation field, a, and back-project the previous shape priors into a point cloud of object-level depth: for learning, we repeat , , a total of times and concatenate them with . The connected features are fed into the MLP to learn . Similarly, to learn , we repeat , , a total of times and connect them with . Another MLP for learning. In this paper, we use shape priors to predict object-level depth, providing guidance for future RGB-based work.

Depth Transform Prediction:To learn the absolute position of the object center, we propose to use a standalone MLP to learn depth translation. We connect and as input. The output is a single value representing the absolute depth of the object’s center. We named it .

The SDDR scheme mainly preserves object shape details from three aspects. First, since we only use image patches to reconstruct object layer depth, the model can focus on the shape of objects rather than the geometry of the entire scene. Second, shape priors provide strong constraints on the shape of objects, making it easier to recover shape details. Third, learn absolute object center and object shape separately and give different attention.

After and are predicted, the object-level depth can be represented as , where is the third component of . Note that we chose to supervise Z rather than backprojected point clouds. Because on the one hand, it is easier for the network to learn Z during training, and on the other hand, by back-projecting Z onto the point cloud for alignment, the 2D coordinates of the object will provide additional constraints on the global position, which is also beneficial for the final object pose recovery steps.


NOCS forecast

We also predict NOCS representations of target objects in the pipeline, a canonical representation used for object-level depth alignment to recover 6D object poses. To predict the NOCS representation, we back-project Z into the point cloud and feed it into the MLP to obtain depth features. Taking , , input, similar to reconstructing object-level depth, we use SPD to predict NOCS representation: However, we find that in some cases, is not realistic enough, which will affect the final 6D object pose estimation accuracy. Therefore, we adopt an adversarial training strategy to train the network. Specifically, we design a discriminator to judge whether the predicted NOCS representation is sufficiently realistic. The optimization objective of the discriminator can be expressed as: Similarly, the optimization objective of the NOCS prediction network: During the training process, the parameters of the discriminator and the NOCS prediction network are iteratively updated. Both networks will become stronger and stronger through confrontation. Consequently, the predicted NOCS performance will also become increasingly realistic.


loss function

For object-level depth reconstruction, we use Z and ground-truth L1 loss: For NOCS representation prediction, we use loss functions including smoothing between reconstructed NOCS representation and ground-truth L1 loss, chamfer distance loss, cross-entropy loss, to The peak distribution of the distribution matrix xmnocs is encouraged, with an L2 regularization loss. The total loss function we use is:



We implement our method via PyTorch and optimize it using the Adam optimizer. We randomly select 1024 pixels to predict depth during training. Detect-Net is Mask-RCNN. Image features are learned using PSPNet with ResNet-18 backbone. The shape prior points are 1024. We set C = 64, = 1024. The model is trained for 50 epochs with a batch size of 96. The main network has an initial learning rate of 0.0001 and a decay rate of 0.1 at the 40th epoch. We conduct experiments on CAMERA25 dataset and REAL275 dataset.

Reconstruction quality evaluation:

In our work, the main idea is to reconstruct object-level depth and NOCS representation. Therefore, we first evaluate the reconstruction quality of our method in Table 1. Calculate the chamfer distance between the backprojected depth and the ground truth value to verify the quality of the depth reconstruction. We also evaluate the NOCS prediction quality by calculating the chamfer distance between the predicted NOCS value and the true NOCS value.

As shown in the table above, for object-level depth reconstruction, the error on the REAL275 dataset is less than 2cm. For NOCS predictions, the error on the REAL275 dataset is also close to 2cm. 2 cm is a relatively small scale error compared to our large object size and depth of field. Therefore, we can conclude that our method does have good object-level depth reconstruction quality and NOCS representation prediction quality. On the CAMERA25 dataset, NOCS for most classes indicates that the prediction error is still under 2cm. However, the object-level depth construction error increases to 3cm to 5cm. The reason for this may be the large variance of the depth distribution in larger synthetic datasets. This observation also suggests that reconstructing object-level depth is harder than predicting NOCS representations.

Quantitative results of 6D pose estimation:

We quantitatively compare our method with the state-of-the-art methods in the table below.

We first compare our method with Lee et al. on the CAMERA25 dataset. Lee et al. predict depth by first reconstructing the mesh and then rendering the mesh into a depth map. In contrast, we choose to reconstruct the object-level depth directly, which is simpler and more efficient. We can see that our method outperforms Lee et al. on 4 out of 6 metrics. On the strictest 10◦10cm metric, our method surpasses Lee et al. by 4.2 points, which is a significant improvement. On the IoU25 and IoU50 metrics, although our results are slightly lower than Lee et al., we still achieve comparable performance. These results suggest that reconstructing object-level depth using our SDDR scheme and NGPH is a better choice than reconstructing object meshes. The main reason may be that if depth translation and shape points are decoupled, it is easier for the network to learn useful information such as shape details of objects or absolute centers of objects.

To further validate our motivation and benefit of reconstructing object-level depth over estimating scene-level depth, we compare our method with two scene-level depth estimation baselines in the table below. Scene-level baseline-1 and scene-level baseline-2 share the same encoder-decoder network architecture. The difference is that during training, scene-level baseline-1 shares the encoder with the NOCS reconstruction branch, while scene-level baseline-2 trains the depth estimator independently. Both networks are carefully tuned for optimal performance. The table below shows that OLD-Net significantly outperforms both scene-level baselines. The reason may be that the object-level depth predicted by OLD-Net is much better than the coarse scene-level depth in preserving shape details, which are crucial for the NOCS depth alignment process.

All these results demonstrate the superiority of our method. In addition, we also show the average precision (AP) plots under different thresholds for 3D IoU, rotation error, and translation error in the following figure. We compare our method with rgbd-based NOCS methods. As can be seen from the figure, our method outperforms in terms of IoU and rotation for all categories. The excellent performance of rotation prediction is largely due to our decoupling of shape points out of depth to preserve shape details. In alignment, the accuracy of rotation depends primarily on the quality of the object’s shape. Therefore, our model even achieves comparable performance to rgbd-based methods in rotation prediction. In contrast, the rgbd-based prediction results are relatively low. This is because recovering the absolute global position of objects from monocular RGB images is a problem. Therefore, future work should pay more attention to obtaining more accurate absolute depth predictions.

Qualitative Results of 6D Pose Estimation:

To qualitatively analyze the effectiveness of our method, we visualize the estimated bounding boxes as shown in the following figure. The calculation results of synthetic data and actual data are given. It can be seen that OLD-Net can predict object bounding boxes, which are accurate enough for augmented reality products. In the image below we also show some failure cases. OLDNet can sometimes miss objects. We make it our future work to solve it.


Ablation experiment

Vanilla SPD:

We employ SPD to learn shape points in SDDR. One might wonder if the good performance of our model comes from the SPD rather than the other modules we designed. Therefore, we show the performance when using only the vanilla SPD module (no OLD-Net, no SDDR, only SPD to directly predict a back-projected point cloud of object-level depth). Without our other designs, the Vanilla SPD performed poorly.

The impact of the SDDR scheme:

In this paper, SDDR is introduced to decouple object-level depth into depth transformations and shape points. Compared to the Vanilla SPD, all of our model versions in the table below use the SDDR scheme, and as a result, their performance is largely improved.

In row 3 of the table, instead of using two separate modules to learn depth transformation and shape points independently, we use a single module to directly predict absolute object-level depth. We found that the IoU25 metric and the IoU50 metric dropped a lot. This may be because if the depth is not decoupled, the network may lose the details of the object, such as the aspect ratio of the object or some specific object components. Furthermore, in Row 4, we show the results of using MLP instead of SPD to predict shape points, i.e. direct regression to NOCS and object-level depth. It is clear that there is a clear drop in all metrics. This result proves that it is very necessary to adopt SPD in SDDR. SPD provides the model with strong constraints on the shape of objects. Note on rows 3 and 4 of the table, although we removed some designs, the 2D coordinates of the pixels belonging to the object are still used for backprojection (which is also part of the SDDR), which will give the absolute depth an extra constraints. Otherwise, the performance is worse, as shown in row 2 of the table, which directly predicts the object point cloud. In conclusion, the SDDR scheme plays an important role in both preservation of object shape details and prediction of absolute object centers in OLD-Net.

Impact of NGPH:

Since our model only takes RGB images to predict depth to preserve shape details, global location information will be lost. To remedy this deficiency, we inject NGPH into our network. In row 5 of the table, we remove NGPH from our network to study its impact. When it was removed, all metrics dropped a lot. This is because, without NGPH, it is difficult for the network to predict absolute depth. Although the relative positions between 3D points can be inferred from the image, the wrong absolute depth makes it difficult for us to accurately recover the pose of the object through alignment.

We employ an adversarial training strategy to improve the quality of the predicted NOCS representations. When it was removed, as shown in the penultimate row of the table, all but the 10◦ metric dropped. This result demonstrates that adversarial training is necessary to improve performance. It also shows that both the quality of the NOCS representation and the depth at the object level are important. Neither can be ignored.



This paper proposes a new class-level 6D object pose estimation network OLD-Net based on rgb. Using shape priors to directly predict object-level depth is the key to our research. A normalized global location cue and a shape-aware decoupled deep reconstruction scheme are introduced in OLD-Net. We also predict the canonical NOCS representation of objects in the pipeline using adversarial training. Extensive experiments on real and synthetic datasets demonstrate that our method can achieve new state-of-the-art performance.

Editor: Huang Fei

Leave a Reply

Your email address will not be published.