Compositional Context Fine-Tuning Vision-Language Model for Complex Assembly Action Understanding from Videos
Hao Zheng, Jinyi Huang, Tiantian Zheng, Xun Xu, Tuka Alhanai
AI summary
Problem
General vision-language models struggle with the subtle motions and fine-grained interactions of assembly tasks, often producing ambiguous outputs unsuitable for precise human-robot collaboration.
Approach
The method decomposes complex actions into verb, object, and tool elements, then fine-tunes a vision-language model using templated visual question-answering pairs alongside a layer-partitioned training strategy to prevent cross-task interference.
Key results
- CCFT enables near-deterministic, interpretable element-level predictions
- LP-AT reduces cross-task interference while enabling per-adapter hyperparameter tuning
- New compositional VQA datasets (HA-ViD-VQA and IKEA-ASM-VQA) are released
- Consistently outperforms strong action recognition baselines on both datasets
Why it matters
Provides a reliable, interpretable foundation for adapting multimodal AI to precise manufacturing and human-robot collaborative assembly tasks.
Abstract
Assembly action understanding is a key enabler for effective human-robot collaborative assembly, yet it remains challenging due to subtle motions and fine-grained hand–object interactions. We adapt vision-language models (VLMs) to this challenging domain with Compositional Context Fine-Tuning (CCFT), a method that decomposes assembly actions into semantic elements (Verb, Object, Tool) and fine-tunes VLMs to recognize each action element using templated question- answering pairs. This approach ensures near-deterministic outputs. To enable efficient and effective multi-task learning under limited data, a Layer-Partitioned Alternating Training (LP-AT) method is presented, which assigns distinct model layers to recognize specific action elements through element- specific low-rank adapters. LP-AT alternates weight updates across element-specific adapters, reducing cross-task interfer- ence while enabling per-adapter hyperparameter optimization. Furthermore, we create HA-ViD-VQA and IKEA-ASM-VQA datasets from existing assembly video datasets. Extensive ex- periments on these datasets demonstrate that our method consistently outperforms strong action recognition baselines while providing interpretable element-level predictions that can support diverse downstream applications. Code and dataset are released at https://github.com/x-labs-xyz/CCFT.