Key Insights and Findings: (a) MoE enables scalable cross-domain generalization without domain labels. (b) Replacing FFNs in PTv3 with sparse MoE layers boosts both performance and efficiency. (c) Experts organically specialize based on data distribution.
While scaling laws have transformed natural language processing and computer vision, 3D point cloud understanding has yet to reach that stage. This lag can be attributed to both the comparatively smaller scale of 3D datasets, as well as the disparate sources of the data itself. Point clouds are captured by diverse sensors (e.g., depth cameras, LiDAR) across varied domains (e.g., indoor, outdoor), each introducing unique scanning patterns, sampling densities, and semantic biases. Such domain heterogeneity poses a major barrier toward training unified models at scale—especially under the realistic constraint that domain labels are typically inaccessible at inference time. In this work, we propose Point-MoE, a Mixture-of-Experts architecture designed to enable large-scale, cross-domain generalization in 3D perception. We show that standard point cloud backbones degrade significantly in performance when trained on mixed-domain data, whereas Point-MoE, with a simple top-k routing strategy, can automatically specialize experts—even without access to domain labels. Our experiments demonstrate that Point-MoE not only outperforms strong multi-domain baselines, but also generalizes better to unseen domains. This work highlights a scalable path forward for 3D understanding: letting the model discover structure in diverse 3D data, rather than imposing it via manual curation or domain supervision.
Given a 3D scene represented as a point cloud \( \mathcal{P} = \{p_i\}_{i=1}^{n} \), where each point \( p_i \in \mathbb{R}^3 \) denotes coordinates, semantic segmentation aims to predict a single class label \( \hat{y}_i \in \mathcal{C} \) to each point \( p_i \) from a fixed label set \( \mathcal{C} = \{c_1, c_2, \ldots, c_m\} \). Multi-domain data are utilized during training. Let the domain containing \( N \) point clouds be \( \mathcal{D}= \left\{ \left(p_i, y_i\right)_{i=1}^{n} \right\}^{N} \). Then the complete training set across \( d \) domains can be represented as \( \left\{ \mathcal{D}_k\right\}_{k=1}^{d} \). The goal of multi-domain joint training is to learn a unified model over all domains that minimizes prediction error across \( \left\{ \mathcal{D}_k \right\} \), while also generalizing effectively to domains unseen during training.
We adopt a minimal Mixture-of-Experts design that closely follows the standard architecture prevalent in the recent NLP literature. This simplicity allows us to leverage existing scalable Transformer-based MoE implementations with minimal modification. For the base model, we use Point Transformer V3 (PTv3)~\cite{ptv3}, a state-of-the-art architecture for point cloud understanding. Below we introduce the architecture of Point-MoE in detail.
Mixture-of-Experts (MoE) Layer. Point-MoE routes input tokens to a sparse subset of expert networks using a lightweight gating function. For each input feature vector, a top-k subset of experts is selected and combined via weighted averaging. This allows the model to scale capacity while maintaining efficiency. We exclude auxiliary load-balancing losses, finding them unnecessary in practice.
Integration into PTv3. We integrate MoE into Point Transformer V3 (PTv3) by replacing its key/query/value projection layers with expert layers. All other components remain unchanged. This lets experts specialize in domain-specific transformations, while maintaining the original PTv3 structure. Proper placement before normalization is crucial to performance.
Language-Guided Classification. To bridge label gaps across datasets, we follow prior work and align point features with CLIP text embeddings of class names. This enables supervision across datasets with mismatched taxonomies (e.g., “pillow” existing in Structured3D but not ScanNet).
Domain-Aware Gating. We explore using domain embeddings to guide expert selection during training. By concatenating input features with a learnable embedding for each dataset, the gating network can more easily specialize experts along domain boundaries, accelerating convergence and improving separation between domains.
Domain Randomization for Robustness. To remove reliance on domain labels at inference time, we introduce a generic domain embedding. During training, 20% of points are randomly assigned this embedding to encourage domain-agnostic routing. This improves robustness to unseen domains while retaining strong performance on seen datasets.
Training efficiency and validation mIoU. THe below figure shows the training loss and validation mIoU curves for four models: the baseline PTv3-L, its improved variant PTv3-Mix-LN (which incorporates mixed-domain batches and LayerNorm), PPT-L, and our proposed Point-MoE-L. All models are trained from scratch across multiple domains. Point-MoE-L converges faster and achieves strong validation mIoU without using explicit dataset labels, matching or exceeding the performance of PPT-L trained with ground-truth domain labels. While all models reach similar training loss, only Point-MoE-L and PPT-L generalize effectively—reinforcing that low training loss is not indicative of strong cross-domain performance. On ScanNet, Structured3D, Matterport3D, and nuScenes, Point-MoE-L shows consistent improvement and avoids early plateaus seen in PPT-L, especially on Structured3D, suggesting stronger long-term learning. PTv3-L fails to generalize and exhibits unstable validation curves.
Expert choice visualization. The below figure showcases expert assignments for a validation scene at selected layers. In (a), we observe that early encoder layers rely heavily on geometric cues for routing. For instance, green experts are consistently activated along object boundaries such as the edges of desks and chairs, while red experts dominate flat surfaces. In (b) and (c), the decoder layers exhibit more semantically meaningful expert selection—likely due to their proximity to the loss function—with distinct experts attending to objects like desks, chairs, floors, and walls. In (d), we examine an outdoor scene with sparse LiDAR data. Despite limited geometric structure, the model still organizes routing meaningfully: nearby points are routed to blue experts, while farther points activate red experts. We include more visualizations in the appendix for completeness. We also note occasional visual artifacts where isolated points are assigned different experts than their neighbors, which may be related to PTv3's architectural choices such as point serialization or positional encoding.
Token pathways. To understand how Point-MoE adapts to diverse domains, we analyze expert routing behavior at the token level. Specifically, we track the top-1 expert assignment for each token across all MoE layers and construct full routing trajectories from the final layer back to the first. We then identify the top 100 most frequent expert paths based on their occurrence across all tokens. As shown in the next figure, encoder expert paths are substantially less diverse than those in the decoder, indicating that encoder layers perform more domain-agnostic processing. Interestingly, we observe a sparse routing pattern in deeper encoder layers. This may be attributed not to feature reuse, but to the U-Net-style design where features are spatially deep but token sparsity increases, resulting in reduced variability in routing decisions.
When examining domain-level trends, we find that certain dataset pairs—such as SemanticKITTI and nuScenes or ScanNet and Structured3D—share similar expert pathways, suggesting that Point-MoE implicitly clusters domains with related geometric or semantic structures. To quantify these observations, we compute the Jensen-Shannon Divergence JSD between expert selection distributions across datasets at each MoE layer. JSD is an entropy-based measure of divergence between expert routing distributions across datasets, weighted by their token proportions; its formal definition is provided in the supplementary. As shown in the next figure (right), decoder layers exhibit significantly higher JSD, indicating stronger domain-specific specialization. Several encoder layers also display nontrivial JSD, underscoring the benefit of placing MoE throughout the network.
Visualization Type | Description | Link |
---|---|---|
Expert Choice Visualization on Matterport3D | Visualization on each layer of the encoder and decoder. | View |
Expert Choice Visualization on nuScenes | Visualization on each layer of the encoder and decoder. | View |
Expert Choice Visualization on S3DIS | Visualization on each layer of the encoder and decoder. | View |
Expert Choice Visualization on ScanNet | Visualization on each layer of the encoder and decoder. | View |
Expert Choice Visualization on SemanticKITTI | Visualization on each layer of the encoder and decoder. | View |
Expert Choice Visualization on Structured3D | Visualization on each layer of the encoder and decoder. | View |
Expert Choice Visualization on Waymo | Visualization on each layer of the encoder and decoder. | View |
@article{chen2025pointmoecrossdomaingeneralization3d,
title={Point-MoE: Towards Cross-Domain Generalization in 3D Semantic Segmentation via Mixture-of-Experts},
author={Xuweiyi Chen and Wentao Zhou and Aruni RoyChowdhury and Zezhou Cheng},
year={2025},
eprint={2505.23926},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23926},
}