Efficient Real-World Benchmarking for Practical Fine-Grained Product Identification in Retail Robotics for Picking and Stock Taking
Jochen Lindermayr, Florian Jordan, Cagatay Odabasi, Werner Kraus, Richard Bormann, Marco F. Huber
AI summary
Problem
Fine-grained product identification remains a bottleneck for retail robotics due to dynamic assortments and thousands of near-duplicates, yet existing benchmarks lack real-world shelf data paired with synthetic scenes for controlled sim2real evaluation.
Approach
We introduce a semi-automated, robot-assisted pipeline that captures real shelf scenes and projects 3D ground truth into images for dense annotation, then evaluates training-free recognition pipelines using frozen DINOv3 features across synthetic and real environments.
Key results
- Semi-automated robot-assisted pipeline for efficient, dense shelf scene annotation
- Extended IPA-3D1K dataset with 130 near-duplicate SKUs across real shelf scenes and controlled lighting/occlusion
- Lightweight classifier head improves over kNN by ~11 pp on FineGrainedOCR, narrowing the gap to fully trained models to 1.9–5.3 pp
- Synthetic-scene retrieval achieves ~90% Top-1 accuracy, while confidence thresholds and neighborhood risk signals effectively guide inference triage
Why it matters
Enables scalable, rapidly updatable product recognition systems for retail robotics and healthcare inventory tracking by providing a standardized real-world benchmark and demonstrating the viability of frozen foundation models.
Abstract
The rapid evolution of retail robotics is set to transform in-store operations through advanced automation, spanning vision-based inventory tracking, order picking, pack- ing, and restocking. Yet fine-grained product identification re- mains a bottleneck: assortments change, packaging evolves, and shelves host thousands of near-duplicates—requiring perception systems that can adapt quickly with minimal setup. This paper targets that gap with two contributions. First, we present a semi-automated, robot-assisted acquisition pipeline that records 3D scene ground truth via iterative placement, projecting it into each image, yielding dense, low-cost annotations at scale. Second, we extend IPA-3D1K with challenging real shelf scenes containing 130 near-duplicate SKUs. While scenes are not paired one-to-one, the same product set appears across synthetic and real images, enabling controlled, object-level sim/real anal- yses under occlusion, rearrangement, and lighting variation. Using frozen DINOv3 features, our baseline recognition pipeline allows index updates in minutes. We evaluate training-free or fast approaches (kNN and a lightweight classifier head) to assess the capabilities and limitations of this representation in fine-grained retail identification. Experiments show that on the FineGrainedOCR dataset the lightweight head improves over kNN by ∼11 percentage points, narrowing the gap to fully trained models to 1.9–5.3 pp. On IPA-3D1K (1,000 SKUs), synthetic-scene retrieval is strong (Top-1 ≈90%, Top-2 ≈95%), while exact disambiguation among near-duplicates remains challenging. We find that confidence thresholds enable targeted triage during inference, and a neighborhood-based risk signal predicts confusion during training, indicating where specialized modules are most beneficial.