UniVideo: Universal Monocular Video Understanding
Yawen Lu, Zhiwen Cao, wei-an Lin, Ratheesh Kalarot
AI summary
Problem
Training and inferring separate models for video flow, depth, and segmentation is computationally expensive and prevents shared feature learning, causing geometric and temporal inconsistencies.
Approach
UniVideo reformulates these tasks as correspondence matching and tracking within a single ConvNeXt-based Transformer, using self-supervised DINOv2 priors and contrastive learning to enforce cross-task consistency.
Key results
- Surpasses FlowFormer on Sintel and KITTI optical flow benchmarks
- Improves depth estimation by 21.6% over DPT on KITTI
- Boosts panoptic segmentation by 21.3% over TarVIS on VIPSeg
- Enables zero-shot inference on unseen scenes with only 57.9M parameters
Why it matters
Offers robotic perception and computer vision researchers a highly efficient, unified alternative to costly multi-model pipelines while improving accuracy and generalization.
Abstract
Video flow, depth, and panoptic segmentation are fundamental to diverse robotic perception and computer vision applications. Despite recent advances in specialized approaches, several inherent limitations remain challenging: first, training and inferencing three separate models is computationally costly; second, separate training prohibits learning underlying feature representations and knowledge from other tasks. In this work, we address these challenges by reformulating video flow estimation, depth estimation and panoptic segmentation as a sequence of feature correspondence matching, updating and tracking problems. This approach allows these tasks to be addressed by a single architecture that compares feature similarities across frames. By incorporating a shared feature representation with distinct prediction heads, our model can simultaneously predict consistent and reliable optical flow, depth maps, and object masks for videos. We further demonstrate that this universal model maintains temporal consistency across tasks while requiring no task-specific re-training. Extensive experiments on the FlyingThings, Sintel, VKITTI, KITTI, and VIPSeg benchmarks demonstrates superior performance. Furthermore, the model exhibits zero-shot performance on unseen wild scenes.