Open Vocabulary Monocular 3D Object Detection

1 University of Virginia 2 California Institute of Technology
3DV 2026
Zero-Shot Performance on In-the-Wild COCO Images

OVMono3D-LIFT's Zero-Shot Performance on In-the-Wild COCO Images. We display 3D predictions overlaid on the images and the top-down views with a base grid of \(1\,\text{m} \times 1\,\text{m}\) tiles. For single-object images, only front-views are displayed.

Abstract

We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categories in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability.

We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (e.g., table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues.

Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

Proposed Methods

Proposed Methods

(a) OVMono3D-GEO is a training-free method that predicts 3D detections from 2D via geometric unprojection. It exploits off-the-shelf depth estimation (e.g., UniDepth), segmentation (e.g., SAM), and OV 2D detector (Grounding DINO). (b) OVMono3D-LIFT is a learning-based approach that trains a class-agnostic neural network to lift 2D detections to 3D. Both approaches disentangle the recognition and location in 2D from the estimation of 3D bounding boxes.

Qualitative Visualizations on the Omni3D Test Set

Qualitative Results 1

For each example, we present the predictions of Cube R-CNN and OVMono3D-LIFT, displaying both the 3D predictions overlaid on the image and a top-down view with a base grid of \(1\,\text{m} \times 1\,\text{m}\) tiles. Base categories are depicted with brown cubes, while novel categories are represented in other colors.

Target-Aware Evaluation

Target-Aware Evaluation

By prompting only categories that exist in the annotations, our target-aware evaluation mitigates the negative impact of missing annotations (e.g., "book" in (a)) and naming ambiguity (e.g., "vase" vs. "potted plant" and "chair" vs. "sofa").

Citation

@misc{yao2024openvocabularymonocular3d,
      title={Open Vocabulary Monocular 3D Object Detection}, 
      author={Jin Yao and Hao Gu and Xuweiyi Chen and Jiayun Wang and Zezhou Cheng},
      year={2024},
      eprint={2411.16833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.16833}, 
}