← Back ICRA 2026

Bi-Manual Joint Camera Calibration and Scene Representation

Haozhan Tang, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi

PDF

AI summary

Key figure (auto-extracted from paper)

Bi-JCR enables marker-free, joint calibration of dual robot arms and metric-scale 3D scene reconstruction using only wrist-mounted RGB cameras and 3D foundation models.

Bi-manual calibration marker-free calibration 3D foundation models hand-eye calibration metric reconstruction bimanual manipulation

Problem

Calibrating dual manipulator wrist-mounted cameras and aligning their coordinate frames traditionally requires cumbersome, marker-based offline procedures that fail in dynamic or cluttered environments.

Approach

The framework uses 3D foundation models to extract dense, unscaled multi-view correspondences from RGB images, then jointly optimizes camera-to-end-effector transforms, inter-arm base poses, and a global scale factor via gradient descent on transformation manifolds.

Key results

Marker-free hand-eye calibration for both arms simultaneously
Accurate recovery of inter-manipulator relative poses and metric scale
Dense, size-consistent 3D workspace reconstruction from RGB images alone
Successful downstream bimanual grasping and object handover tasks

Why it matters

Enables robust, marker-free bimanual robot coordination and environment understanding for real-world manipulation without specialized calibration hardware.

Abstract

Robot manipulation, especially bimanual manip- ulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be cali- brated. Camera calibration is a cumbersome process involv- ing collecting a set of images, with each capturing a pre- determined marker. In this work, we introduce the Bi-Manual Joint Calibration and Representation Framework (Bi-JCR). Bi- JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker-free multi-view correspondence, Bi-JCR jointly estimates: (i) the extrinsic transformation from each camera to its end-effector, (ii) the inter-arm relative poses between manipulators, and (iii) a unified, scale-consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmenta- tion to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi-JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.

Index terms

Perception for Grasping and Manipulation