HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed PIADv1-C benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches.

Key Contributions

Intention-aware MLLM grounding. We aggregate interaction intention from images into a contact-aware embedding that guides the MLLM to infer textual affordance labels — bypassing explicit attribute descriptions and off-the-shelf 2D segmenters.
Hierarchical cross-modal integration. A dedicated mechanism that fuses complementary MLLM features into 3D point representations for refined affordance grounding.
Multi-granular geometry lifting. A module that infuses multi-scale spatial cues into the intention embedding, yielding 3D-aware features for accurate localization.
PIADv1-C benchmark. A newly constructed corruption-style benchmark for 3D affordance grounding, on which HAMMER demonstrates superior robustness over existing approaches.
State-of-the-art results. On PIAD, HAMMER improves aIoU by +1.69 on the Seen split and +5.39 on the Unseen split over the previous best, with consistent gains on PIAD v2 and the PIADv1-C benchmark.

How HAMMER differs from prior work. Existing 3D affordance grounding methods either rely on off-the-shelf 2D segmenters (e.g., GREAT) or generate explicit object attribute descriptions. HAMMER instead extracts a single contact-aware intention embedding directly from an MLLM and lifts it into 3D — yielding stronger generalization to unseen objects (+5.39 aIoU on PIAD Unseen) and improved robustness on PIADv1-C.

HAMMER architecture: an interaction image and a 3D point cloud are fused through an MLLM-derived intention embedding, hierarchical cross-modal integration, and multi-granular geometry lifting to produce the final 3D affordance map.

The overall architecture of HAMMER. Given a 3D point cloud \(\mathbf{P}\) and its corresponding interaction image \(\mathbf{I}\), our framework first processes \(\mathbf{I}\) through a pre-trained MLLM \(\mathcal{F}_{\theta}\) to extract an affordance-guided intention embedding \(\mathbf{f}_c\). This embedding is then used to enhance point cloud features via a hierarchical cross-modal integration mechanism. To imbue \(\mathbf{f}_c\) with 3D spatial awareness, we apply a multi-granular geometry lifting module that incorporates multi-scale geometric cues. Finally, the refined point features \(\tilde{\mathbf{f}}_p\) and the 3D-aware intention embedding \(\mathbf{f}_c^{3D}\) are decoded to produce the final affordance map \(\mathbf{p}\).

PIAD
PIAD v2 (seen)
PIAD v2 (unseen, zero-shot)
Feature integration

Predictions on PIAD dataset

Side-by-side qualitative predictions on the PIAD dataset comparing HAMMER outputs against ground-truth affordance maps.

Results on PIAD v2 seen classes

Qualitative predictions on PIAD v2 seen-class split.

Zero-shot generalization on PIAD v2 unseen splits

Zero-shot affordance grounding on PIAD v2 unseen-class splits.

Effectiveness of the feature integration

Feature visualizations showing the effect of the cross-modal integration mechanism.

Quantitative results on PIAD dataset (+1.69 aIoU on Seen, +5.39 aIoU on Unseen)

Quantitative comparison table on the PIAD dataset.

Quantitative results on PIAD v2 dataset

Quantitative comparison table on the PIAD v2 dataset.

Robustness evaluation

Quantitative results

Quantitative robustness comparison table on the PIADv1-C benchmark.

More qualitative results on the PIADv1-C benchmark

Additional qualitative predictions on the PIADv1-C benchmark.

Alongside the model, we release PIADv1-C — a new corruption-style benchmark for evaluating the robustness of 3D affordance grounding methods. We invite the community to evaluate on PIADv1-C and report results in future work.

PIADv1-C benchmark · Hugging Face dataset (also includes preprocessed PIADv1 and PIADv2)
Pretrained weights · Hugging Face model
Code & training scripts · GitHub

The research work described in this paper was conducted in the JC STEM Lab of Machine Learning and Computer Vision funded by The Hong Kong Jockey Club Charities Trust. This research received partially support from the Global STEM Professorship Scheme from the Hong Kong Special Administrative Region.

@inproceedings{yao2026hammer,
  title={HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding},
  author={Yao, Lei and Chen, Yong and Su, Yuejiao and Wang, Yi and Liu, Moyun and Chau, Lap-Pui},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

HAMMER
Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

CVPR 2026

TL;DR:

Abstract

Key Contributions

Pipeline

Visualization

Predictions on PIAD dataset

Results on PIAD v2 seen classes

Zero-shot generalization on PIAD v2 unseen splits

Effectiveness of the feature integration

Experiments

Quantitative results on PIAD dataset (+1.69 aIoU on Seen, +5.39 aIoU on Unseen)

Quantitative results on PIAD v2 dataset

Robustness evaluation

Quantitative results

More qualitative results on the PIADv1-C benchmark

Resources & Benchmark

Acknowledgment

BibTeX

HAMMER Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

CVPR 2026

TL;DR:

Abstract

Key Contributions

Pipeline

Visualization

Predictions on PIAD dataset

Results on PIAD v2 seen classes

Zero-shot generalization on PIAD v2 unseen splits

Effectiveness of the feature integration

Experiments

Quantitative results on PIAD dataset (+1.69 aIoU on Seen, +5.39 aIoU on Unseen)

Quantitative results on PIAD v2 dataset

Robustness evaluation

Quantitative results

More qualitative results on the PIADv1-C benchmark

Resources & Benchmark

Acknowledgment

BibTeX

HAMMER
Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding