Tencent Hunyuan

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

Jiesong Lian1,2†  ·  Zixiang Zhou2  ·  Ruizhe Zhong3  ·  Yuan Zhou2‡  ·  Qinglin Lu2
Rui Wang  ·  Long Hu1  ·  Yixue Hao1  ·  Baoru Huang4

1Huazhong University of Science and Technology  ·  2Tencent Hunyuan  ·  3Shanghai Jiao Tong University  ·  4University of Liverpool

Work done during internship at Tencent Hunyuan.   Project leader.   §Corresponding author.

SARA teaser: pair-routing overview
SARA routes representation-alignment supervision by the prompt, not by pixels. (1) Token-relation distillation matches the pairwise token relations of the DiT features Vp to a frozen VFM Vy, weighting all O(N2) pairs equally — here FG–FG and FG–BG are only 23% and 50% of pairs, so 27% of the budget is spent on background–background pairs that seldom carry the subject interactions the prompt describes. (2) Semantically Adaptive Relational Distillation feeds the prompt entities into a frozen text-conditioned Semantic Aligner Φ that predicts a per-token saliency, then an OR pair-routing operator Wij = wi+wjwiwj keeps adaptive weights on FG–FG and FG–BG pairs while dropping BG–BG.
Overview

Abstract

Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt.

We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.

TL;DR

SARA recasts representation alignment for video diffusion as a pair-routing problem. A frozen text-conditioned saliency aligner (trained from per-entity SAM 3.1 masks + InfoNCE) tells token-relation distillation which token pairs to supervise, focusing the loss on subject–subject and subject–background relations rather than background filler.

What's new

Key Contributions

Pair-routing view

We recast semantic adaptation for VDMs as a pair-routing problem on top of TRD, and formalise a family of pair-routing operators (AND / OR / XOR) that decide which token pairs carry supervision.

Text-conditioned saliency

A lightweight aligner is trained from per-entity SAM 3.1 masks, per-entity captions, and an InfoNCE regulariser, producing a calibrated continuous saliency that is then fused into TRD.

Consistent gains

Under matched Wan2.2 high-noise continual training, SARA outperforms SFT, VideoREPA, and a MoAlign reproduction on a 13-dimension VLM rubric, on VBench-1.0 / 2.0, and in a blind user study.

How it works

Pipeline

SARA decouples where relational alignment should be applied from how it is computed. Stage 1 trains a saliency aligner with per-entity supervision; Stage 2 freezes the aligner and uses its prediction to route a masked token-relation distillation loss during continual training of the VDM.

SARA pipeline overview
Overview of SARA. Stage I (top): a lightweight aligner on top of frozen V-JEPA, SAM 3.1, and Qwen3-VL-Embedding backbones learns, for any (video, caption) pair, a text-conditioned per-patch saliency Mp, supervised jointly by per-entity, combined-entity, and background SAM masks (BCE) and calibrated by a caption-level InfoNCE. Stage II (bottom): the frozen aligner is queried with the full caption, and its saliency is turned into pair weights that route a masked token-relation distillation loss, added to the diffusion loss of a trainable DiT.
Stage 1 · Train

Text-conditioned saliency aligner

V-JEPA tokens fused with the caption via cross-attention, supervised by per-entity SAM 3.1 masks (BCE) and an InfoNCE regulariser.

Stage 2 · Freeze

Continuous saliency Mp

The frozen aligner is queried with the full caption and emits a per-patch saliency on the V-JEPA grid.

Stage 2 · Train VDM

OR-routed masked TRD

Pair weight Wij = wi+wjwiwj routes TRD onto subject–subject and subject–background pairs, dropping background–background filler.

Inside the model

Method Visualizations

What does the saliency aligner actually learn? The figures below trace the supervision signal end-to-end: from the per-entity SAM 3.1 masks that train Stage 1, to the four query types that the aligner routes during training, to the continuous saliency that Stage 2 emits on the full prompt at inference time, and finally to how the OR pair-routing operator reshapes the TRD budget.

Per-entity SAM 3.1 supervision (Stage 1 targets)

SAM 3.1 per-entity decomposition
SAM 3.1 entity decomposition used as Stage 1 supervision. Top row: input frames. Next three rows: per-entity masks for OBJECT_1 (red), PERSON_1 (green), PERSON_2 (blue). Last two rows: complement BACKGROUND (Mbg = 1Mfg, yellow) and foreground union ALL Entities (Mfg, magenta). All five masks supervise the saliency head jointly via K+2 forwards.

Stage 1 saliency on the four supervision-time query types

Same clip, four different captions fed to the aligner. Each panel shows the input frames, the SAM 3.1 reference mask My, a PCA visualization of the text-conditioned features V'y, and the predicted saliency Mp (jet colormap, redder = higher).

Stage 1 saliency: PERSON_1 query
c1 = PERSON_1 The aligner concentrates almost all its mass on the targeted person and leaves the rest of the scene close to zero.
Stage 1 saliency: PERSON_2 query
c2 = PERSON_2 Swapping the entity caption shifts the highlight cleanly to the second subject without touching the first.
Stage 1 saliency: combined-entity query
cfg = [c1; c2; …] Concatenating per-entity captions yields a calibrated foreground union that matches the SAM Mfg reference.
Stage 1 saliency: background query
cbg = background A pure scene caption pushes saliency onto the cafeteria walls and floor while suppressing the people — exactly the inverse of the union mask.

Stage 2 saliency on the full MTSS caption (inference)

Stage 2 saliency on the full MTSS caption
Stage 2 saliency on the full MTSS caption. The aligner is now queried with the same long structured caption that the VDM also consumes at TRD time. Crucially, Mp is not a hard binary mask: it places the highest values on the named subjects and adaptively fades through intermediate values on nearby background to low values on far-field background — which is what the OR pair weight W needs to grade, rather than just gate, each subject–background pair.
Evidence

Quantitative Results

Under matched Wan2.2 high-noise continual training, SARA wins every aggregate score across three independent protocols. The 13-dimension VLM rubric is averaged over three judges (Qwen3.5-27B, Qwen3.6-35B-A3B, Gemma-4-31B-it).

VLM Rubric (mean over 3 judges)

MethodTA meanTA voteMQ meanMQ vote
Real video (oracle)4.5864.6484.4314.581
Pretrained Wan2.23.9193.9263.8183.877
SFT4.1214.1393.7843.851
VideoREPA4.1254.1543.8023.865
MoAlign4.1274.1543.8023.871
SARA (ours)4.1544.1673.8523.919

Public VBench (%)

MethodVB-1.0 Sem.VB-2.0 Final
Pretrained Wan2.272.7455.00
VideoREPA72.9955.24
MoAlign72.9555.81
SARA (ours)73.8956.19
See for yourself

Qualitative Comparisons

Side-by-side renderings on 18 multi-subject test prompts. All methods share the same Wan2.2 high-noise backbone, the same continual-training schedule, and the same V-JEPA target; only the auxiliary objective changes. Click a caption to read the full prompt.

Pretrained Wan2.2 SFT (CT_Baseline) VideoREPA (CT-REPA) MoAlign SARA (ours)
Cite

BibTeX

If you find this work useful, please consider citing:

@article{lian2026sara,
  title   = {SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models},
  author  = {Lian, Jiesong and Zhou, Zixiang and Zhong, Ruizhe and Zhou, Yuan and
             Lu, Qinglin and Wang, Rui and Hu, Long and Hao, Yixue and Huang, Baoru},
  journal = {arXiv preprint arXiv:2605.07800},
  year    = {2026}
}