Learning to Grasp by Integrating Human Preferences and Success Feedback
Juyeol Park, Byungjin Ko, Jong-Wan Yoon, Taejoon Park, Homin Park
AI summary
Problem
Designing reliable reward functions for end-to-end robotic grasping remains difficult, as handcrafted rewards are prone to reward hacking and preference-based models often align with human intuition but fail to guarantee successful physical execution or generalize to new environments.
Approach
The authors propose a three-stage framework that trains a reward model on human preferences and combines it with binary success feedback into a Weighted Success Reward to fine-tune the grasping policy.
Key results
- First end-to-end RLHF framework for robotic grasping in cluttered scenes
- Curated a standardized human preference dataset for grasping with explicit labeling guidelines
- Achieves higher success and completion rates with fewer collisions in simulation
- Transfers to real-world hardware with less performance degradation than baseline methods
Why it matters
Provides a practical pathway for aligning robotic manipulation with human intuition while ensuring robust, real-world execution, benefiting researchers and engineers in safe robot control.
Abstract
End-to-end robotic grasping increasingly relies on reinforcement learning to enable safe and precise execution, yet defining a reward that consistently drives such behavior remains a central challenge. Human-engineered rewards have been widely explored, but they are prone to reward hacking, depend heavily on artificial design choices, and often fail to capture human intuition. Preference-based reward models offer a promising alternative by aligning policies with human feedback, but their application to robotic grasping has remained limited, and preference-aligned actions do not always translate into successful execution. We propose Human Preference and Success-based Grasping (HPSG), a three-stage framework that combines pre-training, reward modeling, and fine-tuning. At its core is the Weighted Success Reward (WSR), which inte- grates a preference-trained reward model with binary success feedback so that policies learn behaviors that are effective in practice and aligned with human judgment. This design resolves the mismatch between subjective preferences and execution outcomes, thereby improving reliability. Through extensive simulation and real-world experiments, we show that HPSG produces reliable grasping policies, achieving higher success and completion rates, reducing collisions, and transferring to physical settings with smaller performance degradation than baseline methods. Our code is publicly available at: https: //github.com/qkrwnduf1997/HPSG