← Back ICRA 2026

Order Matters: On Parameter-Efficient Image-To-Video Probing for Recognizing Nearly Symmetric Actions

Thinesh Thiyakesan Ponbagavathi, Alina Roitberg

PDF

AI summary

Key figure (auto-extracted from paper)

STEP explicitly models frame order in frozen vision foundation models to accurately recognize nearly symmetric actions while outperforming heavier fine-tuning methods.

Human-Robot Interaction Action Recognition Vision Foundation Models Parameter-Efficient Probing Temporal Modeling Nearly Symmetric Actions

Problem

Standard probing ignores temporal sequence while parameter-efficient fine-tuning overfits on small datasets and demands high compute, making it difficult for robots to distinguish visually similar but temporally opposite actions.

Approach

STEP freezes the vision model backbone and injects temporal order into a lightweight probing head using frame-wise positional encodings, a global CLS token, and a simplified self-attention block.

Key results

4–10% accuracy gain on nearly symmetric actions across three HRI benchmarks
Surpasses heavier PEFT and fully fine-tuned baselines with only 2.6M trainable parameters
Reduces multi-task computation by up to 6× compared to parameter-efficient fine-tuning
Demonstrates that explicit frame-order modeling is essential for disambiguating temporally opposite actions

Why it matters

Enables robots to safely and efficiently interpret subtle, sequence-dependent human intentions in close collaboration using frozen foundation models.

Abstract

Fine-grained understanding of human actions is essential for safe and intuitive human–robot interaction. We study the challenge of recognizing nearly symmetric actions such as picking up vs. placing down a tool or opening vs. closing a drawer. These actions are common in close human- robot collaboration, yet they are rare and largely overlooked in mainstream vision frameworks. Pretrained vision foundation models (VFMs) are often adapted using probing, valued in robotics for its efficiency and low data needs, or parameter- efficient fine-tuning (PEFT), which adds temporal modeling through adapters or prompts. However, our analysis shows that probing is permutation-invariant and blind to frame order, while PEFT is prone to overfitting on smaller HRI datasets, and less practical in real-world robotics due to compute constraints. To address this, we introduce STEP (Self-attentive Temporal Embedding Probing), a lightweight extension to probing that models temporal order via frame-wise positional encodings, a global CLS token, and a simplified attention block. Compared to conventional probing, STEP improves accuracy by 4–10% on nearly symmetric actions and 6–15% overall across action recognition benchmarks in human-robot-interaction, industrial assembly, and driver assistance. Beyond probing, STEP sur- passes heavier PEFT methods and even outperforms fully fine- tuned models on all three benchmarks, establishing a new state- of-the-art. Code and models will be made publicly available: https://github.com/th-nesh/STEP.

Index terms

Gesture Posture and Facial Expressions Human-Robot Collaboration Recognition