Multi-Modal Motion Prediction Using Temporal Ensembling with Learning-Based Aggregation
Kai-Yin Hong, Chieh-Chih Wang, Wen-Chieh Lin
Abstract
Recent years have seen a shift towards learning- based methods for trajectory prediction, with challenges re- maining in addressing uncertainty and capturing multi-modal distributions. This paper introduces Temporal Ensembling with Learning-based Aggregation, a meta-algorithm designed to mit- igate the issue of missing behaviors in trajectory prediction, which leads to inconsistent predictions across consecutive frames. Unlike conventional model ensembling, temporal en- sembling leverages predictions from nearby frames to enhance spatial coverage and prediction diversity. By confirming predic- tions from multiple frames, temporal ensembling compensates for occasional errors in individual frame predictions. Fur- thermore, trajectory-level aggregation, often utilized in model ensembling, is insufficient for temporal ensembling due to a lack of consideration of traffic context and its tendency to assign candidate trajectories with incorrect driving behaviors to final predictions. We further emphasize the necessity of learning-based aggregation by utilizing mode queries within a DETR-like architecture for our temporal ensembling, leverag- ing the characteristics of predictions from nearby frames. Our method, validated on the Argoverse 2 dataset, shows notable improvements: a 4% reduction in minADE, a 5% decrease in minFDE, and a 1.16% reduction in the miss rate compared to the strongest baseline, QCNet, highlighting its efficacy and potential in autonomous driving.