HAMMER HAMMER
Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

CVPR 2026

1The Hong Kong Polytechnic University 2Huazhong University of Science and Technology

TL;DR:

We present HAMMER, a novel approach that harnesses Multi-modal Large Language Models (MLLMs) to ground 3D affordances through cross-modal integration for intention-driven object interaction understanding.

Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches.

Pipeline

Pipeline

The overall architecture of HAMMER. Given a 3D point cloud \(\mathbf{P}\) and its corresponding interaction image \(\mathbf{I}\), our framework first processes \(\mathbf{I}\) through a pre-trained MLLM \(\mathcal{F}_{\theta}\) to extract an affordance-guided intention embedding \(\mathbf{f}_c\). This embedding is then used to enhance point cloud features via a hierarchical cross-modal integration mechanism. To imbue \(\mathbf{f}_c\) with 3D spatial awareness, we apply a multi-granular geometry lifting module that incorporates multi-scale geometric cues. Finally, the refined point features \(\tilde{\mathbf{f}}_p\) and the 3D-aware intention embedding \(\mathbf{f}_c^{3D}\) are decoded to produce the final affordance map \(\mathbf{p}\).

Visualization

Predictions on PIAD dataset.

PIAD Dataset Visualization

Results on PIAD v2 seen classes

PIAD v2 Seen Visualization

Zero-shot generalization on PIAD v2 unseen splits

PIAD v2 Unseen Visualization

Effectiveness of the feature integration

Effectiveness

Experiments

Quantitative results on PIAD dataset

PIAD Results

Quantitative results on PIAD v2 dataset

PIAD v2 Results

Robustness evaluation

Robustness Evaluation

Quantitative results

Robustness Comparison

More qualitative results on the corrupted benchmark

Qualitative Corrupted

Acknowledgment 👏

The research work described in this paper was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust. This research received partially support from the Global STEM Professorship Scheme from the Hong Kong Special Administrative Region.

BibTeX 📝

@inproceedings{yao2026hammer,
  title={HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding},
  author={Yao, Lei and Chen, Yong and Su, Yuejiao and Wang, Yi and Liu, Moyun and Chau, Lap-Pui},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}