Graphite: A GPU-Accelerated Mixed-Precision Graph Optimization Framework
Shishir Gopinath, Karthik Dantu, Steven Ko
AI summary
Problem
Existing GPU-accelerated optimizers struggle with complex, user-defined data types common in SLAM, require cumbersome language interoperation, or consume excessive GPU memory, hindering real-time deployment on resource-constrained devices.
Approach
Graphite introduces a CUDA C++ framework that uses a descriptor-based batching model to process identical graph elements in parallel, supporting mixed-precision solving and in-place optimization to minimize memory overhead and data transfer.
Key results
- General mixed-precision framework supporting 64-bit, 32-bit, and 16-bit floating-point types
- Descriptor batching model that eliminates GPU thread branching for identical vertices and constraints
- Up to 59× speedup over CPU baselines for global visual-inertial bundle adjustment in ORB-SLAM3
- Up to 78% reduction in GPU memory usage compared to specialized solvers like MegBA
Why it matters
Enables efficient, large-scale nonlinear optimization for real-time SLAM and robotics applications on both desktop and embedded hardware.
Abstract
We present Graphite, a GPU-accelerated nonlin- ear least squares graph optimization framework. It provides a CUDA C++ interface to enable the sharing of code between a real-time application, such as a SLAM system, and its optimization tasks. The framework supports techniques to reduce memory usage, including in-place optimization, support for multiple floating point types and mixed-precision modes, and dynamically computed Jacobians. We evaluate Graphite on well-known bundle adjustment problems and find that it achieves similar performance to MegBA, a solver specialized for bundle adjustment, while maintaining generality and using less memory. We also apply Graphite to global visual-inertial bundle adjustment on maps generated from stereo-inertial SLAM datasets, and observe speed-ups of up to 59× compared to a CPU baseline. Our results indicate that our framework enables faster large-scale optimization on both desktop and resource-constrained devices.