DONUT: A Decoder-Only Model for Trajectory Prediction

ICCV 2025

1RWTH Aachen University 2Eindhoven University of Technology

Abstract

Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Different from existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, enhancing the performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the network the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future, and further improves the performance. With experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.

Encoder-decoder vs. decoder-only methods for motion forecasting.
Encoder-decoder vs. decoder-only methods for motion forecasting. In contrast to existing works, which use an encoder-decoder architecture, DONUT uses a unified, autoregressive model to process agents' historical and future trajectories. This allows it to predict trajectories at different timesteps in a consistent manner and receive up-to-date information of relevant scene elements, improving its performance.

Method

DONUT architecture overview.
DONUT architecture overview. Previously predicted sub-trajectories are fed through a proposer module to make a proposal prediction () and an overprediction (). The reference point for all relative encodings is then moved to the endpoint of the proposed trajectory (from to ) Next, the refiner predicts offsets which are added to the proposed trajectories to obtain the final predicted sub-trajectory and overprediction. Before using the predicted sub-trajectory as input to the next decoder step, the reference point is updated to the refined endpoint again (from to ).

DONUT proposer architecture.
Proposer architecture. The input sub-trajectory is first tokenized relative to the reference point (). Then, the tokens attend to (1) sub-trajectory tokens from previous decoder steps, (2) map tokens, (3) nearby agents, and (4) other modes of the same agent. All attention operations use relative positional encodings based on the current reference point (). Finally, a detokenizer outputs the next sub-trajectory and an overprediction. The refiner model has the exact same architecture, only the inputs and outputs differ.

Quantitative Results

Ablation study. We demonstrate the effectiveness of (1) using our decoder-only approach instead the encoder-decoder baseline, (2) the overprediction objective, and (3) the refinement module. Evaluation on the Argoverse2 val set.
Performance at different prediction horizons. Compared to the encoder-decoder baseline, our decoder-only approach makes more accurate predictions at longer prediction horizons.
Comparison to state of the art. We compare to published methods on the test set of the Argoverse 2 leaderboard, and demonstrate that DONUT achieves state-of-the-art performance on the main b-minFDE6 metric. * Denotes the use of model ensembling.

BibTeX

@article{knoche2025donut,
  title   = {{DONUT: A Decoder-Only Model for Trajectory Prediction}},
  author  = {Knoche, Markus and de Geus, Daan and Leibe, Bastian},
  journal = {arXiv preprint arXiv:2506.06854},
  year    = {2025}
}