Physically-Based Lighting Generation for Robotic Manipulation
Shutong Jin, Lezhong Wang, Ben Temming, Florian T. Pokorny
AI summary
Problem
Collecting diverse real-world robotic manipulation data is prohibitively costly, particularly when accounting for dynamic lighting variations that severely degrade policy robustness.
Approach
The method decomposes demonstration frames into geometric and material properties via inverse rendering, physically relights them, and propagates the changes across video sequences using a domain-adapted Stable Video Diffusion model.
Key results
- 38.75% improvement in imitation learning success rates under six unseen lighting conditions
- Physically accurate, temporally consistent relighting from single demonstration frames
- Superior structural and temporal fidelity compared to text-prompt relighting baselines
- Enables downstream environmental augmentations like background and texture generation
Why it matters
Drastically reduces the cost and effort of real-world data collection by enabling robots to learn robust manipulation skills across diverse lighting environments.
Abstract
We propose the first framework that leverages physically-based inverse rendering for novel lighting genera- tion on existing real-world human demonstrations of robotic manipulation tasks. Specifically, inverse rendering decomposes the first frame in each demonstration into geometric (surface normal, depth) and material (albedo, roughness, metallic) prop- erties, which are then used to render appearance changes under different lighting sources. To improve efficiency and maintain consistency across each generated sequence, we fine-tune Stable Video Diffusion on robot execution videos for temporal lighting propagation. We evaluate our framework by measuring the visual quality of the generated sequences, assessing its effec- tiveness in improving the imitation learning policy performance (38.75%) under six unseen real-world lighting conditions, and conducting ablation studies on individual modules of the proposed framework. We further showcase three downstream applications enabled by the proposed framework: background generation, object texture generation and distractor positioning.