Identifying affordance regions on 3D objects from semantic cues is essential for robotics and human-machine interaction. However, existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data and a reliance on 3D backbones focused on geometric encoding, which often lack resilience to real-world noise and data corruption.
We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models. We employ a dual-branch architecture with Gaussian splatting to establish consistent mappings between 3D point clouds and 2D representations, enabling realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion module and a 2D-3D consistency alignment module further strengthen cross-modal alignment and knowledge transfer, allowing the 3D branch to benefit from the rich semantics and generalization capacity of 2D models.
To holistically assess the robustness, we introduce two new corruption-based benchmarks: PIAD-C and LASO-C. Extensive experiments on public datasets and our benchmarks show that GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data, demonstrating robust and adaptable affordance prediction under diverse conditions.
Our proposed GEAL framework consists of two branches: 3D and 2D. The 2D branch is established through 3D Gaussian Splatting to leverage the generalization capabilities of large pre-trained 2D models. We then perform cross-modality alignment, including Granularity-Adaptive Visual-Textual Fusion and 2D-3D Consistency Alignment, to unify features from different modalities into a shared embedding space. Finally, we decode generalizable affordance from this embedding space.
The 2D-3D Consistency Alignment Module maps features from 2D and 3D modalities into a shared embedding space and enforces consistency alignment to enable effective knowledge transfer across branches.
The Granularity-Adaptive Fusion Module consists of a Flexible Granularity Feature Aggregation mechanism (a) and a Text-Conditioned Visual Alignment mechanism (b).
We compare GEAL with state-of-the-art 3D affordance learning methods on the PIAD and LASO dataset. Our method demonstrates strong generalization on both seen and unseen objects.
We establish two corruption-based benchmarks: PIAD-C and LASO-C, to holistically evaluate the robustness of 3D affordance learning under real-world scenarios. We apply seven types of corruptions -- Add Global, Add Local, Drop Global, Drop Local, Rotate, Scale, and Jitter -- each with five severity levels.
We compare GEAL with state-of-the-art 3D affordance learning methods on the proposed PIAD-C and LASO-C corruption dataset. Our method maintains strong generalization and robustness across these challenging scenarios.
@misc{lu2024gealgeneralizable3daffordance,
title={GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency},
author={Dongyue Lu and Lingdong Kong and Tianxin Huang and Gim Hee Lee},
year={2024},
eprint={2412.09511},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.09511},
}