← Back ICRA 2026

Latent Representations for Visual Proprioception in Inexpensive Robots

Sahara Sheikholeslami, Ladislau Bölöni

PDF

AI summary

Key figure (auto-extracted from paper)

Compact latent representations from diverse architectures can effectively estimate the pose of an inexpensive, uncalibrated robot using only a single external camera image.

Visual proprioception Inexpensive robots Latent representations Single-camera pose estimation Low-cost robotics Fine-tuned backbones

Problem

Inexpensive robots often lack reliable internal sensors for joint position tracking, yet visual proprioception typically requires calibrated cameras, depth sensors, or simulators. This paper investigates how accurately a fast, single-pass model can recover robot configuration from a single uncalibrated RGB image under these resource-constrained conditions.

Approach

The authors evaluate four compact latent encoding methods—Conv-VAEs, fine-tuned CNN and ViT backbones, and uncalibrated fiducial marker detections—to compress a single robot image into a low-dimensional vector that feeds a simple MLP regressor for joint angle prediction.

Key results

Proposed four latent encoding techniques (Conv-VAE, fine-tuned CNN/ViT backbones, uncalibrated fiducial markers)
Introduced a universal, size-agnostic MLP regressor requiring only minimal supervised fine-tuning
Demonstrated component-specific accuracy variations across nine models and two latent sizes
Revealed distinct error and noise patterns to guide encoder selection for specific pose metrics

Why it matters

This work enables affordable robots to reliably estimate their own pose using minimal hardware, expanding the applicability of vision-based control in unstructured environments.

Abstract

Robotic manipulation requires explicit or implicit knowledge of the robot’s joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual pro- prioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achiev- able accuracy through experiments on an inexpensive 6-DoF robot.

Index terms

Perception for Grasping and Manipulation Deep Learning for Visual Perception