← Back ICRA 2026

Direct Contact-Tolerant Motion Planning with Vision Language Models

He Li, Jian Sun, Chengyang Li, Guoliang Li, Qiyu Ruan, Shuai Wang,, and Chengzhong Xu

PDF

AI summary

Key figure (auto-extracted from paper)

DCT leverages vision-language models to directly partition point clouds into contact-tolerant and intolerant regions, enabling real-time, robust navigation through cluttered environments with movable obstacles.

Contact-tolerant planning Vision-language models Point cloud partitioning Real-time navigation Deep learning control Movable obstacles

Problem

Existing contact-tolerant motion planning methods rely on indirect spatial representations like prebuilt maps, causing inaccuracies, poor adaptability to environmental changes, and high computational costs for real-time movability reasoning.

Approach

The system uses a vision-language model to identify movable obstacles in images and propagates these masks to lidar scans for real-time point cloud partitioning, then solves the resulting direct point-to-action optimization with a specialized deep neural network for fast control.

Key results

Real-time VLM point cloud partitioner with memory-driven mask propagation
Fast learned planner using a deep neural network to solve large-scale MPC constraints in microseconds
Successful implementation in Isaac Sim and a real car-like robot
Superior navigation performance over baselines in cluttered scenarios with movable obstacles

Why it matters

Enables autonomous robots to safely and efficiently navigate highly cluttered, uncertain environments by directly reasoning about and interacting with movable objects in real time.

Abstract

Navigation in cluttered environments often re- quires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representa- tions (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision–language models (VLMs) into direct point perception and navigation, including two key com- ponents. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming represen- tative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.

Index terms

Autonomous Vehicle Navigation Intelligent Transportation Systems