← Back ICRA 2026

Physically-Based Lighting Generation for Robotic Manipulation

Shutong Jin, Lezhong Wang, Ben Temming, Florian T. Pokorny

PDF

AI summary

Key figure (auto-extracted from paper)

A novel framework generates physically accurate, temporally consistent lighting variations for real-world robotic demonstrations, boosting imitation learning policy performance by 38.75% under unseen lighting.

Inverse rendering Lighting generation Robotic manipulation Imitation learning Video diffusion Data augmentation

Problem

Collecting diverse real-world robotic manipulation data is prohibitively costly, particularly when accounting for dynamic lighting variations that severely degrade policy robustness.

Approach

The method decomposes demonstration frames into geometric and material properties via inverse rendering, physically relights them, and propagates the changes across video sequences using a domain-adapted Stable Video Diffusion model.

Key results

38.75% improvement in imitation learning success rates under six unseen lighting conditions
Physically accurate, temporally consistent relighting from single demonstration frames
Superior structural and temporal fidelity compared to text-prompt relighting baselines
Enables downstream environmental augmentations like background and texture generation

Why it matters

Drastically reduces the cost and effort of real-world data collection by enabling robots to learn robust manipulation skills across diverse lighting environments.

Abstract

We propose the first framework that leverages physically-based inverse rendering for novel lighting genera- tion on existing real-world human demonstrations of robotic manipulation tasks. Specifically, inverse rendering decomposes the first frame in each demonstration into geometric (surface normal, depth) and material (albedo, roughness, metallic) prop- erties, which are then used to render appearance changes under different lighting sources. To improve efficiency and maintain consistency across each generated sequence, we fine-tune Stable Video Diffusion on robot execution videos for temporal lighting propagation. We evaluate our framework by measuring the visual quality of the generated sequences, assessing its effec- tiveness in improving the imitation learning policy performance (38.75%) under six unseen real-world lighting conditions, and conducting ablation studies on individual modules of the proposed framework. We further showcase three downstream applications enabled by the proposed framework: background generation, object texture generation and distractor positioning.

Index terms

Imitation Learning Learning from Demonstration Deep Learning Methods