GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting

ACM MM 2025

1The Hong Kong Polytecnic University   2Huazhong University of Science and Technology

TL;DR:

We present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3DGS techniques

Teaser Image

Performance comparison of GaussianCross on 3D scene understanding tasks. GaussianCross achieves superior performance across various tasks, including semantic segmentation (Sem. Seg.), instance segmentation (Ins. Seg.), and linear probing. Left: full fine-tuning results on various downstream tasks. Right: linear probing accuracy.

Abstract

The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach.

Pipeline

Pipeline

The overall architecture of GaussianCross. The pipeline commences with cuboid-normalized Gaussian initialization to establish coarse primitive means. Gaussian properties are subsequently decoded by G with a feature field. The tri-attribute adaptive distillation splatting is performed to ensure cross-modal consistency.

Experiments

Linear Probing Results

Parameter efficiency via linear probing.

Data Efficiency Results

Data efficiency by limited scenes and point annotations.

Semantic Segmentation Results

Full fine-tuning semantic segmentation results.

Instance Segmentation Results

Full fine-tuning instance segmentation results.

Visualization

ScanNet UMAP Visualization

We visualize input point cloud and learned point representations using UMAP. We also present the corresponding rendered images, depth maps, and semantic-aware feature maps.

ScanNet Tri-attribute Visualization

Qualitative comparison of GaussianCross rendered images, depth, and semantic-aware feature maps with ground truth.

ScanNet Cosine Similarity Visualization

Visualization of activation maps of cosine similarity scores on ScanNet. The query points are highlighted with red cross marks.

S3DIS Cosine Similarity Visualization

Zero-shot representation of GaussianCross on S3DIS. The query points are highlighted with red circles.

ScanNet++ Cosine Similarity Visualization

Zero-shot representation of GaussianCross on ScanNet++. The query points are highlighted with red circles.

Acknowledgment 👏

The research work was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust.

BibTeX 📝

@article{yao2025gaussian,
    title={GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting},
    author={Yao, Lei and Wang, Yi and Zhang, Yi and Liu, Moyun and Chau, Lap-Pui},
    journal={xxx},
    year={2025},
    publisher={xxx}
}