Open Vocabulary Monocular 3D Object Detection

1 University of Virginia 2 California Institute of Technology
Zero-Shot Performance on In-the-Wild COCO Images

OVMono3D-LIFT's Zero-Shot Performance on In-the-Wild COCO Images. We display 3D predictions overlaid on the images and the top-down views with a base grid of \(1\,\text{m} \times 1\,\text{m}\) tiles. For single-object images, only front-views are displayed.

Abstract

In this work, we pioneer the study of open-vocabulary monocular 3D object detection, a novel task that aims to detect and localize objects in 3D space from a single RGB image without limiting detection to a predefined set of categories.

We formalize this problem, establish baseline methods, and introduce a class-agnostic approach that leverages open-vocabulary 2D detectors and lifts 2D bounding boxes into 3D space. Our approach decouples the recognition and localization of objects in 2D from the task of estimating 3D bounding boxes, enabling generalization across unseen categories. Additionally, we propose a target-aware evaluation protocol to address inconsistencies in existing datasets, improving the reliability of model performance assessment.

Extensive experiments on the Omni3D dataset demonstrate the effectiveness of the proposed method in zero-shot 3D detection for novel object categories, validating its robust generalization capabilities. Our method and evaluation protocols contribute towards the development of open-vocabulary object detection models that can effectively operate in real-world, category-diverse environments.

Proposed Methods

Proposed Methods

(a) OVMono3D-GEO is a training-free method that predicts 3D detections from 2D via geometric unprojection. It exploits off-the-shelf depth estimation (e.g., Depth Pro), segmentation (e.g., SAM), and OV 2D detector (Grounding DINO). (b) OVMono3D-LIFT is a learning-based approach that trains a class-agnostic neural network to lift 2D detections to 3D. Both approaches disentangle the recognition and location in 2D from the estimation of 3D bounding boxes.

Qualitative Visualizations on the Omni3D Test Set

Qualitative Results 1

For each example, we present the predictions of Cube R-CNN and OVMono3D-LIFT, displaying both the 3D predictions overlaid on the image and a top-down view with a base grid of \(1\,\text{m} \times 1\,\text{m}\) tiles. Base categories are depicted with brown cubes, while novel categories are represented in other colors.

Target-Aware Evaluation

Target-Aware Evaluation

(a) Naming ambiguity and missing annotations are common in the benchmarks. In this example, the 3D annotations are missing for "books"; the "shelves" share high similarity with "bookcases". (b) This induces inaccurate performance assessment of open-vocabulary 3D detection under the standard evaluation. (c) Our target-aware evaluation effectively resolves this issue by prompting the categories presented in the annotations.

Citation

@misc{yao2024openvocabularymonocular3d,
      title={Open Vocabulary Monocular 3D Object Detection}, 
      author={Jin Yao and Hao Gu and Xuweiyi Chen and Jiayun Wang and Zezhou Cheng},
      year={2024},
      eprint={2411.16833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.16833}, 
}