← Back ICRA 2026

UniVideo: Universal Monocular Video Understanding

Yawen Lu, Zhiwen Cao, wei-an Lin, Ratheesh Kalarot

PDF

AI summary

Key figure (auto-extracted from paper)

A single unified model simultaneously predicts optical flow, depth, and segmentation with higher accuracy, temporal consistency, and zero-shot capability than specialized approaches.

Unified Video Understanding Optical Flow Monocular Depth Panoptic Segmentation Contrastive Learning Self-Supervised Priors

Problem

Training and inferring separate models for video flow, depth, and segmentation is computationally expensive and prevents shared feature learning, causing geometric and temporal inconsistencies.

Approach

UniVideo reformulates these tasks as correspondence matching and tracking within a single ConvNeXt-based Transformer, using self-supervised DINOv2 priors and contrastive learning to enforce cross-task consistency.

Key results

Surpasses FlowFormer on Sintel and KITTI optical flow benchmarks
Improves depth estimation by 21.6% over DPT on KITTI
Boosts panoptic segmentation by 21.3% over TarVIS on VIPSeg
Enables zero-shot inference on unseen scenes with only 57.9M parameters

Why it matters

Offers robotic perception and computer vision researchers a highly efficient, unified alternative to costly multi-model pipelines while improving accuracy and generalization.

Abstract

Video flow, depth, and panoptic segmentation are fundamental to diverse robotic perception and computer vision applications. Despite recent advances in specialized approaches, several inherent limitations remain challenging: first, training and inferencing three separate models is computationally costly; second, separate training prohibits learning underlying feature representations and knowledge from other tasks. In this work, we address these challenges by reformulating video flow estimation, depth estimation and panoptic segmentation as a sequence of feature correspondence matching, updating and tracking problems. This approach allows these tasks to be addressed by a single architecture that compares feature similarities across frames. By incorporating a shared feature representation with distinct prediction heads, our model can simultaneously predict consistent and reliable optical flow, depth maps, and object masks for videos. We further demonstrate that this universal model maintains temporal consistency across tasks while requiring no task-specific re-training. Extensive experiments on the FlyingThings, Sintel, VKITTI, KITTI, and VIPSeg benchmarks demonstrates superior performance. Furthermore, the model exhibits zero-shot performance on unseen wild scenes.

Index terms

Deep Learning for Visual Perception Visual Learning Visual Tracking