← Back ICRA 2026

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-The-Wild Human Demonstrations

Irmak Guzey, Haozhi Qi, Julen Urain, Changhao Wang, Jessica Yin, Chaithanya Krishna Bodduluri, Mike Maroje Lambeta, Lerrel Pinto, Akshara Rai, Jitendra Malik, Tingfan Wu, Akash Sharma, Homanga Bharadhwaj

PDF

AI summary

Key figure (auto-extracted from paper)

Learning dexterous multi-fingered robot manipulation directly from in-the-wild human videos collected with smart glasses, without any robot data or simulation.

dexterous manipulation in-the-wild learning smart glasses imitation learning 3D point policies human-to-robot transfer

Problem

Bridging the embodiment gap and extracting reliable 3D motion cues from unstructured in-the-wild human videos has bottlenecked scalable learning of dexterous robot policies.

Approach

AINA leverages Aria Gen 2 smart glasses to capture in-the-wild human demonstrations, extracts 3D hand keypoints and object point clouds, aligns them to the robot's frame using a single in-scene demo, and trains a 3D point-based transformer policy for direct deployment.

Key results

First framework to train multi-fingered policies without robot data, simulation, or online corrections
Novel domain-alignment method bridging in-the-wild videos and robot environments via 3D point clouds
Successful closed-loop manipulation across five everyday tasks with only ~15 minutes of human video collection per task
Outperforms prior human-to-robot learning baselines in accuracy and cluttered-scene generalization

Why it matters

Eliminates the need for labor-intensive robot data collection, enabling scalable and generalizable dexterous manipulation using only everyday wearable devices.

Abstract

Learning multi-fingered robot policies from hu- mans performing daily tasks in natural environments has long been a grand goal in the robotics community. Achieving this would mark significant progress toward generalizable robot manipulation in human environments, as it would reduce the reliance on labor-intensive robot interaction data collection. Despite substantial efforts, progress toward this goal has been bottle-necked by the embodiment gap between humans and robots, as well as by difficulties in extracting relevant contextual and motion cues that enable learning of autonomous policies from in-the-wild human videos. We claim that with simple yet sufficiently powerful hardware for obtaining human data and our proposed framework AINA, we are now one significant step closer to achieving this dream. AINA enables learning multi-fingered policies from in-the-wild data using Aria Gen 2 glasses. These glasses are lightweight and portable, feature a high-resolution RGB camera, provide accurate on-board 3D head and hand poses, and offer a wide stereo view that can be leveraged for depth estimation of the scene. This setup enables the learning of 3D point-based policies for multi-fingered hands that are robust to background changes and can be deployed directly without requiring any robot data (including online corrections, reinforcement learning, or simulation). We compare our framework against prior human-to-robot policy learning approaches, ablate our design choices, and demonstrate results across five everyday manipulation tasks. Robot rollouts can be best viewed on our website: https://aina-robot. github.io. Correspondence to irmakguzey@nyu.edu.

Index terms

Dexterous Manipulation Perception for Grasping and Manipulation Human Detection and Tracking