Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning
Aditya Narendra, Mukhammadrizo Maribjonov, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov
AI summary
Problem
Robotic manipulation in unstructured environments suffers from partial observability due to occlusions and layout shifts, causing camera-only reactive policies to fail at long-horizon credit assignment and state estimation.
Approach
KG-M3PO fuses a continuously updated 3D scene graph with visual and proprioceptive inputs into a shared latent space, training the graph encoder end-to-end through the RL objective to directly shape relational features for control.
Key results
- Higher success rates and improved sample efficiency over strong baselines in fully observable settings
- Solves inherently partially observable tasks like retrieving occluded objects that camera-only policies cannot handle
- Strong generalization to novel objects and unseen scene configurations across Franka and UR5 robots
- End-to-end graph encoder training aligns relational representations directly with control performance
Why it matters
Provides a scalable, generalizable framework for learning robust manipulation policies in real-world conditions where objects are frequently occluded or rearranged.
Abstract
This paper introduces Knowledge-Guided Mas- sively Multi-task Model-based Policy Optimization (KG- M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. KG-M3PO leverages a model-based policy opti- mization method to control backbone with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism up- dates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.