← Back ICRA 2026

RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching

Xingwu Zhang, Guanxuan Li, Zhuocheng Zhang, Zijun Long

PDF

AI summary

Key figure (auto-extracted from paper)

Dynamically augmenting 2D appearance features with selective 3D geometric reasoning improves warehouse object identification accuracy without requiring explicit 3D sensors.

object identification 3D geometric reasoning RGB-only retrieval warehouse automation keypoint matching domain adaptation

Problem

Reliance on 2D appearance features causes sharp performance drops in warehouse object identification due to viewpoint shifts, occlusion, and packaging variations, while explicit 3D inputs are costly and complex to deploy.

Approach

RoboEye uses a two-stage pipeline that first ranks candidates via 2D features, then selectively applies a lightweight module to trigger 3D re-ranking only when beneficial, using a keypoint-based matcher to compute geometric confidence from RGB images alone.

Key results

First framework to dynamically augment 2D retrieval with implicit 3D geometric re-ranking
MRR-driven training scheme for selective 3D re-ranking activation
Keypoint-based matcher replacing cosine similarity with confidence-weighted geometric correspondences
Outperforms prior SOTA (RoboLLM) by up to 7.1% on Recall@1 on Amazon ARMBench

Why it matters

Enables cost-effective, robust robotic object identification for large-scale warehouse automation using only standard RGB cameras, reducing hardware costs and deployment complexity.

Abstract

The rapidly growing number of product categories in large-scale e-commerce makes accurate object identifica- tion for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase. When combined with diverse packaging, cluttered containers, fre- quent occlusion, and large viewpoint changes, these factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge training–deployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D- feature-awareness module then estimates 3D feature quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary compu- tation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye im- proves Recall@1 by up to 7.1% over the prior state-of-the- art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly avail- able at https://github.com/longkukuhi/RoboEye.

Index terms

Object Detection Segmentation and Categorization Foundations of Automation Deep Learning Methods