← Back ICRA 2026

PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification

Charlie Gauthier, Sacha Morin, Liam Paull

PDF

AI summary

Key figure (auto-extracted from paper)

Automatically converting robot perception maps into interactive simulations enables LLM planners to iteratively refine and verify plans, boosting success rates by ~39% and enhancing safety.

real-to-simulation LLM planning semantic scene reconstruction plan verification robot perception AI alignment

Problem

Creating bespoke simulation environments from real-world robot perception data is traditionally onerous and infeasible, leaving a critical gap for validating and refining LLM-generated robot plans before real-world execution.

Approach

PerceptTwin is a fully automated pipeline that transforms open-vocabulary 3D semantic scene maps into interactive simulations using 3D asset generation, affordance prediction, and an LLM-based judge to verify plan correctness and alignment.

Key results

Fully automated real-to-simulation pipeline from open-vocabulary scene graphs
LLM plan success improved by an average of ~39% across GPT-5 variants
LLM judge detects unsafe or infeasible plans and suggests corrections
Human plan prediction accuracy improved by up to 18% for precondition failures

Why it matters

It provides a scalable, interpretable foundation for safer and more reliable robot planning by bridging real-world perception with simulation-based verification.

Abstract

reasoning and planning. PerceptTwin consumes such a world representation and generates a corresponding simulation environment. This simulation can then be used for auditing robot plans, counterfactual analysis, and has the benefit of being more interpretable. Abstract— Simulation environments are useful for both robot policy learning and planning verification and validation. Tra- ditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each individual environment that a robot would operate in was simply infeasi- ble. In this work, we introduce PerceptTwin, a fully automatic pipeline that constructs interactive simulations directly from se- mantic scene representations produced by a robot’s perception stack. PerceptTwin combines open-vocabulary object maps with 3D asset generation, affordance prediction, and commonsense condition checking. These interactive simulations can be used to validate and refine plans before they are executed on the robot hardware. Borrowing from the AI alignment literature, we also introduce an LLM judge that verifies plan correctness and alignment with human preferences. Experiments show that PerceptTwin feedback allows LLM planners to refine plans, enhance safety, and resist harmful black-box prompting attacks. In our suite of tasks, PerceptTwin improves plan success by an average of ≈39% for GPT5, GPT5Mini, and GPT5Nano planners. Additionally, PerceptTwin also improves human plan verification by up to 18% on average for plans that fail due to unfilled skill preconditions. Our results demonstrate the poten- tial of open-vocabulary scene simulation from robot perception as a foundation for safer, more reliable robot planning. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) [PGS D Scholarships for Charlie Gauthier and Sacha Morin, and funding reference number ALLRP 580895- 2022], as well as the support of Denso International. (Corresponding author: charlie.gauthier@mila.quebec.) 1Department of Computer Science and Operations Research, Universit ́e de Montr ́eal, Montr ́eal, QC, Canada. 2Mila - Quebec AI Institute, Montr ́eal, QC, Canada. 3 CIFAR AI Chair.

Index terms

Semantic Scene Understanding Task Planning AI-Enabled Robotics