← Back ICRA 2026

Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning

Aditya Narendra, Mukhammadrizo Maribjonov, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov

PDF

AI summary

Key figure (auto-extracted from paper)

Integrating a dynamically updated, end-to-end trained 3D scene graph into a multi-task RL policy significantly improves sample efficiency, robustness, and generalization for robotic manipulation under partial observability.

Multi-task reinforcement learning Partially observable manipulation 3D scene graphs End-to-end graph learning Knowledge-guided robotics Model-based policy optimization

Problem

Robotic manipulation in unstructured environments suffers from partial observability due to occlusions and layout shifts, causing camera-only reactive policies to fail at long-horizon credit assignment and state estimation.

Approach

KG-M3PO fuses a continuously updated 3D scene graph with visual and proprioceptive inputs into a shared latent space, training the graph encoder end-to-end through the RL objective to directly shape relational features for control.

Key results

Higher success rates and improved sample efficiency over strong baselines in fully observable settings
Solves inherently partially observable tasks like retrieving occluded objects that camera-only policies cannot handle
Strong generalization to novel objects and unseen scene configurations across Franka and UR5 robots
End-to-end graph encoder training aligns relational representations directly with control performance

Why it matters

Provides a scalable, generalizable framework for learning robust manipulation policies in real-world conditions where objects are frequently occluded or rearranged.

Abstract

This paper introduces Knowledge-Guided Mas- sively Multi-task Model-based Policy Optimization (KG- M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. KG-M3PO leverages a model-based policy opti- mization method to control backbone with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism up- dates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.

Index terms

Reinforcement Learning Semantic Scene Understanding Machine Learning for Robot Control