HAMMER logo HAMMER
Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

CVPR 2026

1The Hong Kong Polytechnic University 2Huazhong University of Science and Technology

TL;DR:

We present HAMMER, a novel approach that harnesses Multi-modal Large Language Models (MLLMs) to ground 3D affordances through cross-modal integration for intention-driven object interaction understanding.

Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed PIADv1-C benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches.

Key Contributions

  • Intention-aware MLLM grounding. We aggregate interaction intention from images into a contact-aware embedding that guides the MLLM to infer textual affordance labels — bypassing explicit attribute descriptions and off-the-shelf 2D segmenters.
  • Hierarchical cross-modal integration. A dedicated mechanism that fuses complementary MLLM features into 3D point representations for refined affordance grounding.
  • Multi-granular geometry lifting. A module that infuses multi-scale spatial cues into the intention embedding, yielding 3D-aware features for accurate localization.
  • PIADv1-C benchmark. A newly constructed corruption-style benchmark for 3D affordance grounding, on which HAMMER demonstrates superior robustness over existing approaches.
  • State-of-the-art results. On PIAD, HAMMER improves aIoU by +1.69 on the Seen split and +5.39 on the Unseen split over the previous best, with consistent gains on PIAD v2 and the PIADv1-C benchmark.

How HAMMER differs from prior work. Existing 3D affordance grounding methods either rely on off-the-shelf 2D segmenters (e.g., GREAT) or generate explicit object attribute descriptions. HAMMER instead extracts a single contact-aware intention embedding directly from an MLLM and lifts it into 3D — yielding stronger generalization to unseen objects (+5.39 aIoU on PIAD Unseen) and improved robustness on PIADv1-C.

Pipeline

HAMMER architecture: an interaction image and a 3D point cloud are fused through an MLLM-derived intention embedding, hierarchical cross-modal integration, and multi-granular geometry lifting to produce the final 3D affordance map.

The overall architecture of HAMMER. Given a 3D point cloud \(\mathbf{P}\) and its corresponding interaction image \(\mathbf{I}\), our framework first processes \(\mathbf{I}\) through a pre-trained MLLM \(\mathcal{F}_{\theta}\) to extract an affordance-guided intention embedding \(\mathbf{f}_c\). This embedding is then used to enhance point cloud features via a hierarchical cross-modal integration mechanism. To imbue \(\mathbf{f}_c\) with 3D spatial awareness, we apply a multi-granular geometry lifting module that incorporates multi-scale geometric cues. Finally, the refined point features \(\tilde{\mathbf{f}}_p\) and the 3D-aware intention embedding \(\mathbf{f}_c^{3D}\) are decoded to produce the final affordance map \(\mathbf{p}\).

Visualization

Predictions on PIAD dataset

Side-by-side qualitative predictions on the PIAD dataset comparing HAMMER outputs against ground-truth affordance maps.

Results on PIAD v2 seen classes

Qualitative predictions on PIAD v2 seen-class split.

Zero-shot generalization on PIAD v2 unseen splits

Zero-shot affordance grounding on PIAD v2 unseen-class splits.

Effectiveness of the feature integration

Feature visualizations showing the effect of the cross-modal integration mechanism.

Experiments

Quantitative results on PIAD dataset (+1.69 aIoU on Seen, +5.39 aIoU on Unseen)

Quantitative comparison table on the PIAD dataset.

Quantitative results on PIAD v2 dataset

Quantitative comparison table on the PIAD v2 dataset.

Robustness evaluation

Qualitative robustness evaluation under input corruptions.

Quantitative results

Quantitative robustness comparison table on the PIADv1-C benchmark.

More qualitative results on the PIADv1-C benchmark

Additional qualitative predictions on the PIADv1-C benchmark.

Resources & Benchmark

Alongside the model, we release PIADv1-C — a new corruption-style benchmark for evaluating the robustness of 3D affordance grounding methods. We invite the community to evaluate on PIADv1-C and report results in future work.

Acknowledgment

The research work described in this paper was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust. This research received partially support from the Global STEM Professorship Scheme from the Hong Kong Special Administrative Region.

BibTeX

@inproceedings{yao2026hammer,
  title={HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding},
  author={Yao, Lei and Chen, Yong and Su, Yuejiao and Wang, Yi and Liu, Moyun and Chau, Lap-Pui},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}