Direct Contact-Tolerant Motion Planning with Vision Language Models
He Li, Jian Sun, Chengyang Li, Guoliang Li, Qiyu Ruan, Shuai Wang,, and Chengzhong Xu
AI summary
Problem
Existing contact-tolerant motion planning methods rely on indirect spatial representations like prebuilt maps, causing inaccuracies, poor adaptability to environmental changes, and high computational costs for real-time movability reasoning.
Approach
The system uses a vision-language model to identify movable obstacles in images and propagates these masks to lidar scans for real-time point cloud partitioning, then solves the resulting direct point-to-action optimization with a specialized deep neural network for fast control.
Key results
- Real-time VLM point cloud partitioner with memory-driven mask propagation
- Fast learned planner using a deep neural network to solve large-scale MPC constraints in microseconds
- Successful implementation in Isaac Sim and a real car-like robot
- Superior navigation performance over baselines in cluttered scenarios with movable obstacles
Why it matters
Enables autonomous robots to safely and efficiently navigate highly cluttered, uncertain environments by directly reasoning about and interacting with movable objects in real time.
Abstract
Navigation in cluttered environments often re- quires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representa- tions (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision–language models (VLMs) into direct point perception and navigation, including two key com- ponents. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming represen- tative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.