← Back ICRA 2026

Compositional Context Fine-Tuning Vision-Language Model for Complex Assembly Action Understanding from Videos

Hao Zheng, Jinyi Huang, Tiantian Zheng, Xun Xu, Tuka Alhanai

PDF

AI summary

Key figure (auto-extracted from paper)

Decomposing assembly actions into semantic elements and fine-tuning vision-language models with templated questions yields highly accurate, interpretable, and deterministic action recognition for robotics.

Assembly action understanding Vision-language models Compositional fine-tuning Human-robot collaboration Parameter-efficient adaptation Visual question answering

Problem

General vision-language models struggle with the subtle motions and fine-grained interactions of assembly tasks, often producing ambiguous outputs unsuitable for precise human-robot collaboration.

Approach

The method decomposes complex actions into verb, object, and tool elements, then fine-tunes a vision-language model using templated visual question-answering pairs alongside a layer-partitioned training strategy to prevent cross-task interference.

Key results

CCFT enables near-deterministic, interpretable element-level predictions
LP-AT reduces cross-task interference while enabling per-adapter hyperparameter tuning
New compositional VQA datasets (HA-ViD-VQA and IKEA-ASM-VQA) are released
Consistently outperforms strong action recognition baselines on both datasets

Why it matters

Provides a reliable, interpretable foundation for adapting multimodal AI to precise manufacturing and human-robot collaborative assembly tasks.

Abstract

Assembly action understanding is a key enabler for effective human-robot collaborative assembly, yet it remains challenging due to subtle motions and fine-grained hand–object interactions. We adapt vision-language models (VLMs) to this challenging domain with Compositional Context Fine-Tuning (CCFT), a method that decomposes assembly actions into semantic elements (Verb, Object, Tool) and fine-tunes VLMs to recognize each action element using templated question- answering pairs. This approach ensures near-deterministic outputs. To enable efficient and effective multi-task learning under limited data, a Layer-Partitioned Alternating Training (LP-AT) method is presented, which assigns distinct model layers to recognize specific action elements through element- specific low-rank adapters. LP-AT alternates weight updates across element-specific adapters, reducing cross-task interfer- ence while enabling per-adapter hyperparameter optimization. Furthermore, we create HA-ViD-VQA and IKEA-ASM-VQA datasets from existing assembly video datasets. Extensive ex- periments on these datasets demonstrate that our method consistently outperforms strong action recognition baselines while providing interpretable element-level predictions that can support diverse downstream applications. Code and dataset are released at https://github.com/x-labs-xyz/CCFT.

Index terms

Computer Vision for Manufacturing Recognition Assembly