SAB3R: Semantic-Augmented Backbone in 3D Reconstruction

University of Virginia1 University of Michigan2
*Denotes Equal Contribution
Comparison of ROPE with Existing Benchmarks

SAB3R is a semantic-augmented backbone for 3D reconstruction that enables zero-shot, open-vocabulary segmentation and 3D reconstruction from unposed images in a single forward pass. It generates dense 2D foundational features while simultaneously predicting both depth and semantic information, unlocking new capabilities for 3D understanding.

Abstract

The emergence of 3D vision foundation models (VFMs) represents a significant breakthrough in 3D computer vision. However, these models often lack robust semantic understanding due to the scarcity of 3D-language paired data. In contrast, 2D foundation models, trained on abundant data, excel in semantic tasks. In this work, we propose a novel distillation approach, SAB3R, that transfers dense, per-pixel semantic features from 2D VFMs to enhance 3D VFMs. Our method achieves 2D semantic-aware feature integration while retaining the spatial reasoning capabilities of 3D VFMs. We validate our approach by showing that distillation does not compromise the base 3D foundation model, as demonstrated through evaluations on depth estimation and multi-view pose regression. Additionally, we introduce a new task, Map and Locate, to showcase the novel capability of multi-view 3D open-vocabulary semantic segmentation. Finally, our experiments reveal that SAB3R maintains a robust understanding of 3D structures while markedly improving 2D semantic comprehension. These results highlight the effectiveness of our approach.

SAB3R: Multi-View Integration of 2D and 3D Representations

SAB3R integrates dense features from CLIP and DINO into a unified framework, enriching 3D representations with 2D semantic understanding. Each encoder-decoder pair processes multi-view images with shared weights, ensuring consistent feature extraction across views. The model generates depth, dense DINOv2, and dense CLIP features simultaneously, which are then utilized for multi-view 3D reconstruction and semantic segmentation. This architecture enables our base model MASt3R to achieve both geometric and semantic comprehension in a cohesive model.

MASt3R Framework

We distill dense features from CLIP and DINO into the MASt3R framework, enriching it with 2D semantic understanding. Each encoder-decoder pair operates on multi-view images, sharing weights and exchanging information to ensure consistent feature extraction across views. The model simultaneously generates depth, dense DINOv2, and dense CLIP features, which are then used for multi-view 3D reconstruction and semantic segmentation.

Additional Map and Locate Details

The Map and Locate framework is evaluated on the ScanNet dataset, a large-scale indoor scene dataset providing RGB-D sequences, camera poses, and semantic and instance annotations. For our experiments, we selected 10 scenes from the validation split, featuring diverse object layouts and camera trajectories. Across these scenes, there are 436 objects with semantic and instance-level ground truth annotations.

For evaluation, we constructed 60 image groups in total, with each scene contributing 2 sets of groups containing 2, 3, or 4 images. Image selection followed these criteria:

  • Object visibility: Objects within each group are visible across multiple images for reliable localization and mapping.
  • Viewpoint diversity: Images are selected from varying camera viewpoints to test robustness to occlusion and perspective changes.
Each group is paired with its corresponding RGB images, depth maps, camera poses (intrinsics and extrinsics), and semantic and instance labels, providing a comprehensive benchmark for evaluating mapping accuracy and object localization performance.

The dataset statistics visualization below illustrates camera translation differences and rotation differences at different group levels. Translation differences are computed as the Euclidean distance between translation vectors, while rotation differences are calculated as the geodesic distance on the rotation space \( SO(3) \). As the number of views increases, these differences grow, showcasing the variability in camera poses. Despite this variability, our framework demonstrates consistent performance across all group levels, highlighting its robustness.

Camera Distributions

Camera Distributions. Camera translation differences and rotation differences at different group levels.

Sparse View Performance Comparison

This section presents the performance comparison of different methods across sparse view configurations (2, 3, and 4 views). Metrics include mean Intersection over Union (mIoU), Accuracy, Mean Completeness (Comp.), Median Completeness, and Inference Time. Inference time includes both reconstruction and CLIP feature extraction. Our method, SAB3R, consistently outperforms the baseline across all configurations, demonstrating its robustness and efficiency in integrating semantic and geometric understanding.

Model Sparse View = 2 Sparse View = 3 Sparse View = 4
mIoU Acc. Comp. Median Comp. Time (s) mIoU Acc. Comp. Median Comp. Time (s) mIoU Acc. Comp. Median Comp. Time (s)
Baseline 4.57 18.10 0.64 0.67 3.92 6.03 21.26 0.68 0.71 22.74 5.12 19.31 0.68 0.70 36.72
SAB3R (C) 17.26 41.11 0.73 0.75 1.92 22.83 53.19 0.78 0.81 7.49 19.92 48.07 0.77 0.80 9.97
SAB3R (CD) 17.50 42.72 0.73 0.76 2.54 22.94 52.86 0.77 0.80 8.67 20.31 46.26 0.75 0.78 12.15

Performance comparison across sparse views. This table reports mIoU, accuracy, completeness metrics, and inference time for various configurations of sparse views (2, 3, and 4). SAB3R methods demonstrate superior performance across all metrics.

3D Point Cloud Visualization

Explore the 3D point cloud reconstruction using the interactive viewer below. The visualization demonstrates the robustness of the SAB3R framework, enabling detailed geometric and semantic understanding of 3D structures. Click and drag to navigate the scene!

Select Visualization Type

Please choose the type of scene visualization. RGB shows the raw reconstructed point cloud, CLIP is colored using PCA, and DINO is colored based on predicted semantic features.

Select a Scene

Choose one of the five available scenes to explore different views and configurations of the 3D point cloud. Each scene offers unique perspectives on the data.

Visualization Viewer

Use the interactive viewer below to explore the selected 3D point cloud. Click and drag to navigate within the scene.

Citation


  @article{SAB3R,
    author    = {Xuweiyi Chen and Tian Xia and Sihan Xu and Jianing Yang and Joyce Chai and Zezhou Cheng},
    title     = {SAB3R: Semantic-Augmented Backbone in 3D Reconstruction},
    year      = {2024},
    note      = {Equal contribution by Xuweiyi Chen and Tian Xia},
    institution = {University of Virginia, University of Michigan},
  }