Qualitative Results
Browse KITTI predictions interactively. The 2D panel shows detection overlays with IoU scores. The 3D viewer renders LiDAR point clouds with ground truth (green) and predictions (red). Use mouse to orbit, scroll to zoom.
Loading 3D Scene…
Overview
Monocular 3D object detection, predicting the 3D position, size, and orientation of objects from a single camera image, is essential for mobile robotics yet remains challenging because a single 2D image is inherently ambiguous in depth. While 2D foundation models like MOBIUS excel at open-vocabulary recognition, they lack the geometric reasoning needed for 3D tasks.
This semester project, conducted jointly with Google Research and the ETH CVG Group (Prof. Marc Pollefeys), presents MOBIUS-3D: a framework that lifts the efficient MOBIUS 2D model into 3D by integrating a lightweight depth predictor and a depth-guided transformer decoder. The key insight is dense depth distillation, transferring geometric knowledge from the large DepthPro foundation model to a compact student network, enabling accurate 3D detection without costly LiDAR-dense supervision.
The Challenge
A critical gap exists between 2D and 3D perception. 2D models train on millions of images spanning thousands of categories, while 3D datasets are sparse, expensive to annotate, and limited to a handful of classes (cars, pedestrians, cyclists). Existing 3D detectors are heavy, specialized architectures trained from scratch on this limited data; they cannot leverage the rich semantic knowledge of modern 2D foundation models.
The goal: lift a lightweight 2D model into 3D without sacrificing its efficiency or semantic richness, and bridge the 3D annotation gap by distilling geometric knowledge from a large depth foundation model.
Method
MOBIUS-3D adds three components to the frozen MOBIUS backbone: a depth predictor, a depth encoder, and a depth-guided transformer decoder.
1 · Depth Predictor
A DPT-style regression head fuses multi-scale backbone features to predict a dense depth map. Supervised via distillation from the DepthPro foundation model, providing a signal for every pixel, not just sparse LiDAR points.
2 · Depth Encoder
Transforms visual features into depth-aware tokens using deformable attention, attending to a small set of sampling points per query instead of the full feature map, reducing complexity from O(N²) to O(N·K).
3 · Depth-Guided Decoder
Object queries attend to 3D-aware positional embeddings generated by projecting 2D pixels into metric 3D space using the predicted depth and camera intrinsics.
Key Insight · Dense Depth Distillation
Standard monocular 3D detectors train depth with sparse LiDAR ground truth (< 5 % of pixels). We instead distill from DepthPro, a frozen teacher that provides calibrated, dense depth for every pixel. The teacher's predictions are aligned to metric scale using sparse LiDAR via L2-optimized scale + shift calibration. This single change drives the largest performance gain in the ablation study.
Results
Evaluated on the KITTI 3D Object Detection benchmark (Car class, IoU = 0.7). MOBIUS-3D achieves the best APBEV across all difficulty levels, the primary metric for 3D localization accuracy.
| Method | AP3D (%) | APBEV (%) | ||||
|---|---|---|---|---|---|---|
| Easy | Mod. | Hard | Easy | Mod. | Hard | |
| MonoDETR | 28.05 | 20.76 | 16.76 | 37.60 | 27.28 | 23.38 |
| MonoDETRNext-A | 32.95 | 25.01 | 21.92 | — | — | — |
| MOBIUS-3D (Ours) | 31.92 | 24.22 | 22.58 | 41.83 | 31.64 | 30.04 |
Key Takeaways
- Best Hard AP3D (22.58 %) across all methods, the most discriminating difficulty level.
- +4–7 % APBEV over MonoDETR across all difficulties, showing substantially better 3D localization.
- Dense distillation is the key factor: switching from sparse LiDAR to DepthPro distillation yields +7.71 % APBEV (Moderate), confirming that dense geometry from a foundation model far outperforms sparse ground truth.
Ablation: Depth Supervision Strategy
The choice of depth supervision signal is the single most impactful design decision. Dense distillation consistently dominates.
| Depth Supervision | AP3D (%) | APBEV (%) | ||||
|---|---|---|---|---|---|---|
| Easy | Mod. | Hard | Easy | Mod. | Hard | |
| None | 19.39 | 12.16 | 10.31 | 31.79 | 22.12 | 19.30 |
| Object-Level Only | 17.35 | 11.10 | 10.25 | 27.52 | 18.70 | 17.86 |
| Sparse LiDAR | 23.15 | 16.69 | 15.54 | 31.42 | 23.93 | 22.15 |
| Dense Distillation (Ours) | 31.92 | 24.22 | 22.58 | 41.83 | 31.64 | 30.04 |