SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

Overview

Abstract

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt.

We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.

TL;DR

SARA recasts representation alignment for video diffusion as a pair-routing problem. A frozen text-conditioned saliency aligner (trained from per-entity SAM 3.1 masks + InfoNCE) tells token-relation distillation which token pairs to supervise, focusing the loss on subject–subject and subject–background relations rather than background filler.

What's new

Key Contributions

Pair-routing view

We recast semantic adaptation for VDMs as a pair-routing problem on top of TRD, and formalise a family of pair-routing operators (AND / OR / XOR) that decide which token pairs carry supervision.

Text-conditioned saliency

A lightweight aligner is trained from per-entity SAM 3.1 masks, per-entity captions, and an InfoNCE regulariser, producing a calibrated continuous saliency that is then fused into TRD.

Consistent gains

Under matched Wan2.2 high-noise continual training, SARA outperforms SFT, VideoREPA, and a MoAlign reproduction on a 13-dimension VLM rubric, on VBench-1.0 / 2.0, and in a blind user study.

How it works

Pipeline

SARA decouples where relational alignment should be applied from how it is computed. Stage 1 trains a saliency aligner with per-entity supervision; Stage 2 freezes the aligner and uses its prediction to route a masked token-relation distillation loss during continual training of the VDM.

Stage 1 · Train

Text-conditioned saliency aligner

V-JEPA tokens fused with the caption via cross-attention, supervised by per-entity SAM 3.1 masks (BCE) and an InfoNCE regulariser.

→

Stage 2 · Freeze

Continuous saliency M_p

The frozen aligner is queried with the full caption and emits a per-patch saliency on the V-JEPA grid.

→

Stage 2 · Train VDM

OR-routed masked TRD

Pair weight W_ij^∨ = w_i+w_j−w_iw_j routes TRD onto subject–subject and subject–background pairs, dropping background–background filler.

Inside the model

Method Visualizations

What does the saliency aligner actually learn? The figures below trace the supervision signal end-to-end: from the per-entity SAM 3.1 masks that train Stage 1, to the four query types that the aligner routes during training, to the continuous saliency that Stage 2 emits on the full prompt at inference time, and finally to how the OR pair-routing operator reshapes the TRD budget.

Per-entity SAM 3.1 supervision (Stage 1 targets)

SAM 3.1 per-entity decomposition — **SAM 3.1 entity decomposition used as Stage 1 supervision.** Top row: input frames. Next three rows: per-entity masks for `OBJECT_1` (red), `PERSON_1` (green), `PERSON_2` (blue). Last two rows: complement `BACKGROUND` (M_bg = 1−M_fg, yellow) and foreground union `ALL Entities` (M_fg, magenta). All five masks supervise the saliency head jointly via K+2 forwards.

Stage 1 saliency on the four supervision-time query types

Same clip, four different captions fed to the aligner. Each panel shows the input frames, the SAM 3.1 reference mask M_y, a PCA visualization of the text-conditioned features V'_y, and the predicted saliency M_p (jet colormap, redder = higher).

Stage 1 saliency: PERSON_1 query — c₁ = PERSON_1 The aligner concentrates almost all its mass on the targeted person and leaves the rest of the scene close to zero.

Stage 1 saliency: PERSON_2 query — c₂ = PERSON_2 Swapping the entity caption shifts the highlight cleanly to the second subject without touching the first.

Stage 1 saliency: combined-entity query — c_fg = [c₁; c₂; …] Concatenating per-entity captions yields a calibrated foreground union that matches the SAM M_fg reference.

Stage 1 saliency: background query — c_bg = background A pure scene caption pushes saliency onto the cafeteria walls and floor while suppressing the people — exactly the inverse of the union mask.

Stage 2 saliency on the full MTSS caption (inference)

Evidence

Quantitative Results

Under matched Wan2.2 high-noise continual training, SARA wins every aggregate score across three independent protocols. The 13-dimension VLM rubric is averaged over three judges (Qwen3.5-27B, Qwen3.6-35B-A3B, Gemma-4-31B-it).

VLM Rubric (mean over 3 judges)

Method	TA mean	TA vote	MQ mean	MQ vote
Real video (oracle)	4.586	4.648	4.431	4.581
Pretrained Wan2.2	3.919	3.926	3.818	3.877
SFT	4.121	4.139	3.784	3.851
VideoREPA	4.125	4.154	3.802	3.865
MoAlign	4.127	4.154	3.802	3.871
SARA (ours)	4.154	4.167	3.852	3.919

Public VBench (%)

Method	VB-1.0 Sem.	VB-2.0 Final
Pretrained Wan2.2	72.74	55.00
VideoREPA	72.99	55.24
MoAlign	72.95	55.81
SARA (ours)	73.89	56.19

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

Abstract