← Back ICRA 2026

Decentralized Triangulation Formation without Communication: A Vision Transformer Based Learning Approach

Xinchi Huang, Guang YANG, Yi Guo

PDF

AI summary

Key figure (auto-extracted from paper)

A Vision Transformer-based policy enables scalable, communication-free multi-robot triangulation formation using only onboard LiDAR data.

Multi-robot formation Vision Transformer decentralized control LiDAR perception swarm robotics learning-based control

Problem

Traditional multi-robot formation control relies on explicit communication or centralized coordination, which limits scalability and introduces processing delays. This paper addresses how to achieve robust, decentralized triangulation formation using only local sensor observations.

Approach

The method converts onboard LiDAR scans into occupancy maps, processes them as image patches via a Vision Transformer, and learns an end-to-end policy that outputs velocity commands for each robot independently.

Key results

99–100% success rate across team sizes of 5 to 13 robots
Seamless scalability to arbitrary team sizes despite training on only 7 robots
Robust dynamic reconfiguration when robots are added or removed mid-operation
Validated in both Gazebo simulations and real-world RoboMaster experiments

Why it matters

Provides a scalable, communication-free blueprint for deploying large multi-robot swarms in dynamic, resource-constrained environments.

Abstract

Multi-robot cooperative control has traditionally relied on model-based distributed methods, but separating per- ception and control in such pipelines often introduces processing delays and error accumulation. This paper presents a novel decentralized control strategy for multi-robot triangulation formation using Vision Transformers (ViTs). Unlike existing methods that rely on robot-to-robot communication or central- ized coordination, the proposed approach learns an end-to-end control policy that scales to an arbitrary number of robots, solely using on-board sensor data. By segmenting LiDAR- based occupancy maps into patches and processing them as sequences, the ViT encoder captures spatial relationships in the environment. A subsequent multi-layer perceptron outputs control commands that drive each robot to form a planar triangulation with prescribed inter-robot distances—all without explicit communication among robots. The learned policy is validated in both simulations and real-world experiments on a group of RoboMaster platforms. Experimental results demon- strate robust and scalable formation performance across diverse conditions.

Index terms

Machine Learning for Robot Control Multi-Robot Systems Motion and Path Planning