GaussianCross - Lei Yao

Abstract

The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self-supervised pre-training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale-inconsistent 3D point clouds into a unified cuboid-normalized Gaussian representation without missing details, enabling stable and generalizable pre-training. Subsequently, a tri-attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross-modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (0.1% parameters) and limited data training (1% of scenes) compared to state-of-the-art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine-tuning accuracy by 9.3% mIoU and 6.1% AP₅₀ on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach.

Pipeline

The overall architecture of GaussianCross. The pipeline commences with cuboid-normalized Gaussian initialization to establish coarse primitive means. Gaussian properties are subsequently decoded by G with a feature field. The tri-attribute adaptive distillation splatting is performed to ensure cross-modal consistency.

Parameter efficiency via linear probing.

Data efficiency by limited scenes and point annotations.

Full fine-tuning semantic segmentation results.

Full fine-tuning instance segmentation results.

We visualize input point cloud and learned point representations using UMAP. We also present the corresponding rendered images, depth maps, and semantic-aware feature maps.

Qualitative comparison of GaussianCross rendered images, depth, and semantic-aware feature maps with ground truth.

Visualization of activation maps of cosine similarity scores on ScanNet. The query points are highlighted with red cross marks.

Zero-shot representation of GaussianCross on S3DIS. The query points are highlighted with red circles.

Zero-shot representation of GaussianCross on ScanNet++. The query points are highlighted with red circles.

Acknowledgment 👏

The research work was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust.

BibTeX 📝

@article{yao2025gaussian,
    title={GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting},
    author={Yao, Lei and Wang, Yi and Zhang, Yi and Liu, Moyun and Chau, Lap-Pui},
    journal={xxx},
    year={2025},
    publisher={xxx}
}

GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting

ACM MM 2025

TL;DR:

Abstract

Pipeline

Experiments

Visualization

Acknowledgment 👏

BibTeX 📝