Learning Depth with Very Sparse Supervision

Tag: Paper Reading

Published on: 28 Dec 2020


Introduction

Depth estimation is a crucial step towards inferring scene geometry from 2D images. The goal in monocular depth Estimation is to predict the depth value of each pixel, given only a single RGB image as input. Previously, methods including multi-view geometry, supervised and unsupervised learning have been applied to solve this task. This work [1], being a supervised-learning method, was inspired by the idea of learning via interaction. And different from the majority of supervised-learning methods, it only requires very sparse supervision - down to only 1 pixel of ground-truth data - to achieve competitive results as network learned with dense supervision and traditional multi-view geometry based methods.

Contribution of This Paper

The authors proposed a novel global-local architecture, which estimates depth in a 2-step manner: First, the input image pair is passed into a convolution encoder in the global module, to generate a low-dimensional vector. Then, the optical flow is passed into the local module, which is conditioned on the vector generated from the global module, and produces the final depth estimation.

Network Architecture

This design has its analogy from classical projective geometry methods, such that the global and local module corresponds respectively to relative camera pose estimation and triangularization algorithm. In the classical linear triangularization method, in order to recover a 3D point \(M\) from 2D point correspondence \(\{m_1 \leftrightarrow m_2\}\), one needs to know their corresponding camera matrices \(P_1\) and \(P_2\) first. Therefore in this network, both point correspondences from input optical flow and the output vector from the global module are used as input to the local module for final depth estimation. It is also worth noting that for a given image pair, the same camera matrices are shared for every corresponding pixel. This explains why an average pooling is used after the final convolution layer in the encoder. This bottleneck layer also enables sparse supervision, and improves generalization. In addition, being a learning-based method, the local network estimates depth without explicitly apply geometric equations, enabling it to pick pixel-level depth cues, which classic methods fail.

From the training perspective, this method no longer assumes the availability of dense supervision, which greatly eased the burden of data collection for traditional learning algorithms. It also opens possible applications like learning with limited sensing, or from noisy datasets.

The current methods used for tackling the depth estimation problem can be categorized into multi-view geometry, supervised and unsupervised learning methods.

After decades of research, multi-view geometry methods are stable against large motion and requires less computation. However, for this type of algorithm feature-engineering is a laborous process. They also rely heavily on projective geometry, which hinders the algorithm to exploit non-motion-related cues. Supervised learning don’t have such limitations, and has been widely used for this task. However, data collection especially for large real scenes is a challenging problem. Some researchers tried to solve it by using simulated data, but then generalization for the learned network can be an issue. As an alternative, unsupervised (or self-supervised) learning no longer requires ground-truth labels for training the network. This type of network makes use of classical projective geometry and photometric consistency across frames. However, they are in gereral harder to train, and are more sensitive to hyperparameters.

This work, being a supervised-learning method, learns 3D representation without explicit geometric constraints. However, different from other network with encoder-decoder architecture, the encoder output vector is transformed into decoder weights by a linear perceptron, while optical flow is alternatively used as its decoder input. This design choice is inspired by classical geometric methods and enables sparse supervision.

Strong and weak points about this paper

For this new problem setting of learning from sparse supervision, the network can predict better depth than several other baseline networks. In fact, when the number of supervision pixel decreases from dense to 1 pixel, this network outperforms the baselines by more than 20%. However, limited by the bottleneck design, the network shown worse performance in the dense case.

The authors have also invistigated the network’s robustness in changing environments by showing its competitive performance when focal lengths and center of projection changes between views. However, its robustness against changing lighting condition still remains a question, as it happens more often in practice.

An advantage of this method is its focus on transferable representaitons. The network made effective use of optical flow as its input for depth estimation. And the authors have further demonstrated that by adding a simple MLP head, camera ego-motion can also be estimated, which can be useful for downstream tasks.

Furthermore, the local module is also robust against optical flow outliers in producing depth prediction. The authors claim that it is due to the global parameter also stores information about the scene, and the local module with receptive field larger than 1. To this point, extra studies on these two variables should be made to make it convincing.

Lastly, since the purpose of this work is to study a new problem setting, the readers should not expect its general advantage over state-of-the-art methods. And due to this reason, the authors only followed the evaluation pipeline of a relatively old paper in 2016, and only included 2 datasets in the indoors environment and 1 simulated dataset. Other widely used datasets e.g. Sintel and outdoor datasets e.g. KITTI are unfortunately not evaluated.

Possible Follow-up work

Limited by the nature of this network, it has difficulties learning from movements such as pure rotation. One possible solution is by including a recurrent network to utilize information from the past.

Another common difficulty shared with other learning-based methods is the generalization problem across datasets, this could be softened by training with multiple datasets.

This network can be applied for tasks like robot manipulation, by combining with reinforcement learning, enabling the robot to learn the 3D shapes actively with onboard monocular camera and its haptic sensor.

Another possible application is for autonomous-driving cars to learn depths in a low-cost way. Many modern cars are equipped with front-facing camera and radar sensor. And by combining the sparse depth information from the radar with camera images, it could enable the car to learn depth without extra data source.

References

[1] Antonio Loquercio, Alexey Dosovitskiy and Davide Scaramuzza - “Learning Depth with Very Sparse Supervision” (arXiv)


© Chengxin Wang. All rights reserved.