STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

1Imperial College London, UK    2FAU Erlangen-Nürnberg, Germany

Abstract

This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.

Overview

Given a driving audio signal, STARCaster can animate real portrait images (first row) or synthesize entirely novel, identity-consistent talking sequences using only an identity embedding of the subject (second row) . Thanks to its unified spatio-temporal design, which leverages the implicit multi-view structure in video data, our model generalizes to continuous and smooth viewpoint manipulation without relying on explicit 3D representations.

Method

Building on a pre-trained identity-aware image backbone, we construct an autoregressive video diffusion architecture, which unifies identity- and audio-driven animation, reference-based synthesis, and viewpoint control. Spefically, we extend the core attention block of the 2D UNet, with a decoupled multisource cross-attention mechanism for integrating independent conditioning streams (identity, audio, and camera), and an extended self-attention mechanism for injecting appearance features from the input or past frames via a reference encoder.

method

Our progressive training strategy departs from existing talking portrait approaches in three ways:

  1. Before addressing reference-based animation, we pre-train the video model using only identity and audio conditioning , enabling free-motion synthesis under softer identity constraints than those imposed by repeated portrait conditioning. We also introduce a perceptual lip-reading loss that enhances audiovisual alignment.
  2. Then, we propose a self-forcing strategy, performing autoregressive generation during training using the model's own previous predictions. This teaches the model to correct its outputs over longer temporal contexts, resulting in more natural motion, greater expressiveness, and improved identity consistency.
  3. Finally, a lightweight adaptation stage optimizes the model for viewpoint manipulation using a synthetic dataset of 3D heads rendered along smooth camera trajectories, enabling spatio-temporal generation at inference time.
method

Results

ID-Driven Animations

Comparison to Audio-Driven Baselines

Effect of Lip-Reading Loss

During audiovisual training, we incorporate a pre-trained lip-reading network to supervise the correspondence between generated and ground-truth mouth movements. As can be seen, while the model trained without lip-reading supervision still produces reasonable audiovisual alignment, it often fails to precisely generate the appropriate mouth shape.

BibTeX

@article{paraperas2025starcaster,
      title={STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits},
      author={Paraperas Papantoniou, Foivos and Galanakis, Stathis and Potamias, Rolandos Alexandros and Kainz, Bernhard and Zafeiriou, Stefanos},
      journal={arXiv preprint arXiv:2512.13247},
      year={2025}
}