ETH Zürich · Computer Vision & Geometry Group · Fall 2025

MOBIUS goes 3D

Efficient monocular 3D object detection via depth-guided transformers and dense depth distillation.

Python PyTorch Transformers 3D Vision

Qualitative Results

Browse KITTI predictions interactively. The 2D panel shows detection overlays with IoU scores. The 3D viewer renders LiDAR point clouds with ground truth (green) and predictions (red). Use mouse to orbit, scroll to zoom.

KITTI 2D detection panel

Loading 3D Scene…

Toggles

Overview

Monocular 3D object detection, predicting the 3D position, size, and orientation of objects from a single camera image, is essential for mobile robotics yet remains challenging because a single 2D image is inherently ambiguous in depth. While 2D foundation models like MOBIUS excel at open-vocabulary recognition, they lack the geometric reasoning needed for 3D tasks.

This semester project, conducted jointly with Google Research and the ETH CVG Group (Prof. Marc Pollefeys), presents MOBIUS-3D: a framework that lifts the efficient MOBIUS 2D model into 3D by integrating a lightweight depth predictor and a depth-guided transformer decoder. The key insight is dense depth distillation, transferring geometric knowledge from the large DepthPro foundation model to a compact student network, enabling accurate 3D detection without costly LiDAR-dense supervision.

The Challenge

A critical gap exists between 2D and 3D perception. 2D models train on millions of images spanning thousands of categories, while 3D datasets are sparse, expensive to annotate, and limited to a handful of classes (cars, pedestrians, cyclists). Existing 3D detectors are heavy, specialized architectures trained from scratch on this limited data; they cannot leverage the rich semantic knowledge of modern 2D foundation models.

The goal: lift a lightweight 2D model into 3D without sacrificing its efficiency or semantic richness, and bridge the 3D annotation gap by distilling geometric knowledge from a large depth foundation model.

Method

MOBIUS-3D adds three components to the frozen MOBIUS backbone: a depth predictor, a depth encoder, and a depth-guided transformer decoder.

1 · Depth Predictor

A DPT-style regression head fuses multi-scale backbone features to predict a dense depth map. Supervised via distillation from the DepthPro foundation model, providing a signal for every pixel, not just sparse LiDAR points.

2 · Depth Encoder

Transforms visual features into depth-aware tokens using deformable attention, attending to a small set of sampling points per query instead of the full feature map, reducing complexity from O(N²) to O(N·K).

3 · Depth-Guided Decoder

Object queries attend to 3D-aware positional embeddings generated by projecting 2D pixels into metric 3D space using the predicted depth and camera intrinsics.

Key Insight · Dense Depth Distillation

Standard monocular 3D detectors train depth with sparse LiDAR ground truth (< 5 % of pixels). We instead distill from DepthPro, a frozen teacher that provides calibrated, dense depth for every pixel. The teacher's predictions are aligned to metric scale using sparse LiDAR via L2-optimized scale + shift calibration. This single change drives the largest performance gain in the ablation study.

Results

Evaluated on the KITTI 3D Object Detection benchmark (Car class, IoU = 0.7). MOBIUS-3D achieves the best APBEV across all difficulty levels, the primary metric for 3D localization accuracy.

Method AP3D (%) APBEV (%)
Easy Mod. Hard Easy Mod. Hard
MonoDETR 28.05 20.76 16.76 37.60 27.28 23.38
MonoDETRNext-A 32.95 25.01 21.92
MOBIUS-3D (Ours) 31.92 24.22 22.58 41.83 31.64 30.04

Key Takeaways

  • Best Hard AP3D (22.58 %) across all methods, the most discriminating difficulty level.
  • +4–7 % APBEV over MonoDETR across all difficulties, showing substantially better 3D localization.
  • Dense distillation is the key factor: switching from sparse LiDAR to DepthPro distillation yields +7.71 % APBEV (Moderate), confirming that dense geometry from a foundation model far outperforms sparse ground truth.

Ablation: Depth Supervision Strategy

The choice of depth supervision signal is the single most impactful design decision. Dense distillation consistently dominates.

Depth Supervision AP3D (%) APBEV (%)
Easy Mod. Hard Easy Mod. Hard
None 19.39 12.16 10.31 31.79 22.12 19.30
Object-Level Only 17.35 11.10 10.25 27.52 18.70 17.86
Sparse LiDAR 23.15 16.69 15.54 31.42 23.93 22.15
Dense Distillation (Ours) 31.92 24.22 22.58 41.83 31.64 30.04