MASt3R-Nav: WayPixel Navigation in Relative 3D Maps
predicts a trajectory rollout that guides the robot toward the target.
AI summary
Problem
Classical 3D maps require globally consistent geometry, while image- or object-relative topological graphs sacrifice geometric understanding, limiting navigation to teach-and-repeat or coarse control. There is a need for a representation that balances geometric precision with computational feasibility without requiring global registration.
Approach
The authors construct a pixel-level topological graph using relative 3D correspondences from the MASt3R model, compute shortest-path costs to generate a dense 'WayPixel Costmap,' and train a neural controller conditioned on this fine-grained costmap to predict trajectory rollouts.
Key results
- Proposes MASt3R-Nav, a topological navigation pipeline using pixel-relative connectivity
- Introduces the WayPixel Costmap as a dense planning-to-control interface
- Trains PixelReact, a controller that exploits fine-grained cost gradients for robust trajectory rollout
- Outperforms object- and image-level baselines with an SPL of 81.77 in simulator and real-world tests
Why it matters
Provides a computationally feasible, geometrically precise navigation framework that improves robotic control robustness without requiring globally consistent 3D maps.
Abstract
Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and- repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a “WayPixel Costmap” representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.