← Back ICRA 2026

P3T: Prototypical Point-Level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

Geunyoung Jung, Soohong Kim, Kyungwoo Song, Jiyoung Jung

PDF

AI summary

Key figure (auto-extracted from paper)

P3T enables parameter-efficient adaptation of 3D vision-language models that matches or beats full fine-tuning while preserving strong generalization across datasets.

Prompt tuning 3D vision-language models parameter-efficient fine-tuning point cloud generalization prototypical loss

Problem

Adapting pre-trained 3D vision-language models to downstream tasks via full fine-tuning is computationally expensive, while existing prompt tuning methods often overfit and degrade the model's inherent generalization capability.

Approach

P3T uses input-space prompting by adding a Point Prompter that generates instance-aware offsets for vulnerable point cloud patches and a Text Prompter with learnable context vectors, combined with a prototypical loss to align embeddings and reduce intra-category variance.

Key results

Matches or exceeds full fine-tuning accuracy on classification with 91% fewer parameters
Achieves state-of-the-art few-shot learning performance on noisy real-world point clouds
Demonstrates superior cross-dataset generalization under significant data shifts
Preserves zero-shot capabilities through text prompt consistency regularization

Why it matters

Enables efficient, generalizable adaptation of 3D vision-language models for resource-constrained real-world applications without sacrificing pre-trained knowledge.

Abstract

With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P3T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P3T consists of two components: 1) Point Prompter, which generates instance-aware point-level prompts for the input point cloud, and 2) Text Prompter, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P3T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we in- troduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at https://github.com/gyjung975/P3T.

Index terms

Deep Learning for Visual Perception Recognition Visual Learning