Learning View-Invariant Sign Language Representations Via Dual-Stream Contrastive Learning
Yuting Peng, Yuecong Min, Xilin Chen
AI summary
Problem
Current sign language recognition systems degrade significantly under viewpoint variations, struggling with occlusions and visual noise in real-world settings where frontal camera data is unavailable.
Approach
ViSL processes accurate 2D keypoints and synthesized 3D-augmented keypoints through a shared feature extractor, aligning them via cross-view and cross-stream contrastive losses with learnable class proxies.
Key results
- 67.14% Top-1 accuracy on cross-view MM-WLAuslan, surpassing prior SOTA by 9.2%
- Bridges viewpoint and estimation gaps via dual-stream contrastive learning with learnable proxies
- Maintains lightweight single-stream inference while training with noisy 3D-augmented data
- Demonstrates robust generalization across in-the-wild, synthetic, studio, and temporal disturbance test subsets
Why it matters
Enables robust, real-world sign language recognition for assistive technologies and human-robot interaction without requiring multi-camera setups or heavy inference costs.
Abstract
Viewpoint shifts significantly change how gestures and facial expressions appear and frequently cause occlusions, posing a critical challenge for robust Sign Language Recogni- tion (SLR). To address this challenge, we exploit the spatial flexibility and computational efficiency of skeleton data and propose ViSL, a dual-stream contrastive learning framework to learn View-invariant representations for Sign Language understanding. Specifically, the primary and lifting streams share a common visual feature extractor with different types of input: the primary stream (P-Stream) directly processes frontal-view skeleton data, and the lifting stream (L-Stream) synthesizes skeleton data from arbitrary viewpoints based on 3D estimations. We further propose a view-invariant contrastive loss to align representations across both viewpoints and streams. Experimental results on the challenging cross-view setting of MM-WLAuslan demonstrate that ViSL achieves substantial performance improvements, highlighting its potential for robust real-world SLR applications.