← Back ICRA 2026

Efficient Real-World Benchmarking for Practical Fine-Grained Product Identification in Retail Robotics for Picking and Stock Taking

Jochen Lindermayr, Florian Jordan, Cagatay Odabasi, Werner Kraus, Richard Bormann, Marco F. Huber

PDF

AI summary

Key figure (auto-extracted from paper)

Frozen vision foundation features combined with a lightweight classifier enable rapid, training-free fine-grained product identification, though distinguishing near-duplicates on real shelves remains challenging.

Fine-grained recognition Retail robotics Foundation models Sim2real benchmarking Semi-automated annotation Product identification

Problem

Fine-grained product identification remains a bottleneck for retail robotics due to dynamic assortments and thousands of near-duplicates, yet existing benchmarks lack real-world shelf data paired with synthetic scenes for controlled sim2real evaluation.

Approach

We introduce a semi-automated, robot-assisted pipeline that captures real shelf scenes and projects 3D ground truth into images for dense annotation, then evaluates training-free recognition pipelines using frozen DINOv3 features across synthetic and real environments.

Key results

Semi-automated robot-assisted pipeline for efficient, dense shelf scene annotation
Extended IPA-3D1K dataset with 130 near-duplicate SKUs across real shelf scenes and controlled lighting/occlusion
Lightweight classifier head improves over kNN by ~11 pp on FineGrainedOCR, narrowing the gap to fully trained models to 1.9–5.3 pp
Synthetic-scene retrieval achieves ~90% Top-1 accuracy, while confidence thresholds and neighborhood risk signals effectively guide inference triage

Why it matters

Enables scalable, rapidly updatable product recognition systems for retail robotics and healthcare inventory tracking by providing a standardized real-world benchmark and demonstrating the viability of frozen foundation models.

Abstract

The rapid evolution of retail robotics is set to transform in-store operations through advanced automation, spanning vision-based inventory tracking, order picking, pack- ing, and restocking. Yet fine-grained product identification re- mains a bottleneck: assortments change, packaging evolves, and shelves host thousands of near-duplicates—requiring perception systems that can adapt quickly with minimal setup. This paper targets that gap with two contributions. First, we present a semi-automated, robot-assisted acquisition pipeline that records 3D scene ground truth via iterative placement, projecting it into each image, yielding dense, low-cost annotations at scale. Second, we extend IPA-3D1K with challenging real shelf scenes containing 130 near-duplicate SKUs. While scenes are not paired one-to-one, the same product set appears across synthetic and real images, enabling controlled, object-level sim/real anal- yses under occlusion, rearrangement, and lighting variation. Using frozen DINOv3 features, our baseline recognition pipeline allows index updates in minutes. We evaluate training-free or fast approaches (kNN and a lightweight classifier head) to assess the capabilities and limitations of this representation in fine-grained retail identification. Experiments show that on the FineGrainedOCR dataset the lightweight head improves over kNN by ∼11 percentage points, narrowing the gap to fully trained models to 1.9–5.3 pp. On IPA-3D1K (1,000 SKUs), synthetic-scene retrieval is strong (Top-1 ≈90%, Top-2 ≈95%), while exact disambiguation among near-duplicates remains challenging. We find that confidence thresholds enable targeted triage during inference, and a neighborhood-based risk signal predicts confusion during training, indicating where specialized modules are most beneficial.

Index terms

Inventory Management Data Sets for Robotic Vision Recognition