VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning

💡 Motivation

Video Anomaly Understanding (VAU) is vital for real-world applications such as smart cities, security surveillance, and disaster response. Unlike conventional video analysis tasks, VAU demands fine-grained inspection and multi-step reasoning in complex, dynamic, and often unpredictable environments. While early research focused primarily on binary classification (normal vs. abnormal) and temporal localization, such methods offer limited interpretability and provide little understanding of the underlying causes of anomalies.

Recent advances in Multimodal Large Language Models (MLLMs) have improved transparency by generating textual descriptions of anomalous events. However, four major challenges remain:

Lack of coherent, multi-step reasoning chains to explain anomalies
Unclear strategies for decomposing complex VAU tasks into effective subtasks for model supervision
Absence of rich, annotated benchmarks that support structured causal reasoning
Insufficient evaluation protocols to assess the quality of visual reasoning

To move beyond shallow detection and enable deeper, interpretable anomaly understanding, we ask a central question: What types of tasks (or reasoning pathways) can help enhance the understanding of anomalies? To explore this, we introduce a framework consisting of three parallel training tasks. These tasks are co-trained either individually or in combination, and their effectiveness is evaluated through a dedicated anomaly analysis task.

Perception: Identify the scene and relevant entities through guided multiple-choice questions
Grounding: Accurately localize the temporal segment where the anomaly occurs
Classification: Categorize the anomaly based on observed evidence (e.g., fighting, robbery)
Analysis: Evaluate how well the model can explain why the anomaly occurred through structured causal reasoning

This decomposition enables models to acquire diverse reasoning skills, build deeper semantic understanding, and generate more interpretable outputs aligned with each task—ultimately supporting more robust and explainable video anomaly analysis.

🔍 Key Contributions

VAU-R1: A data-efficient Reinforcement Fine-Tuning (RFT) framework that enhances the multi-step reasoning capabilities of Multimodal Large Language Models (MLLMs) for video anomaly understanding. It is optimized using task-specific reward signals and consistently outperforms supervised fine-tuning across VAU tasks.
VAU-Bench: The first benchmark tailored for Chain-of-Thought video anomaly reasoning. It includes 4,600+ videos with diverse anomaly types and provides rich annotations such as multiple-choice QA, temporal intervals, anomaly classes, and step-by-step reasoning rationales.
Unified Evaluation Protocol: A comprehensive evaluation framework that assesses models along four key dimensions: perception accuracy, temporal grounding precision, classification correctness, and reasoning quality—capturing both interpretability and analytical depth.

📊 Results

We evaluate VAU-R1 on the VAU-Bench dataset, which encompasses diverse real-world scenarios from MSAD, UCF-Crime, and ECVA. The evaluation focuses on four key tasks: multiple-choice question answering (QA), temporal anomaly grounding, anomaly classification, and anomaly analysis.

🧠 Multiple-Choice QA

Reasoning hurts without structure: Base models often underperform when forced to reason via <think> prompts, indicating unstructured reasoning introduces noise.
RFT consistently boosts performance: Across MSAD and UCF-Crime, RFT significantly improves both Acc_{w/o think} and Acc_w/think, while maintaining output structure.
Best overall QA accuracy: VAU-R1 (Qwen2.5-VL-3B+RFT) achieves 88.33% QA accuracy on MSAD and 92.03% on UCF-Crime, outperforming all baselines.

🧩 Anomaly Analysis

RFT improves reasoning depth: On MSAD, Qwen2.5-VL-3B+RFT achieves a total VAU-Eval score of 33.38, outperforming SFT (15.96) and base models.
Balanced reasoning quality: RFT achieves notable gains in classification accuracy (CLS), key matching (KM), and factual consistency (FAC), providing coherent and factual explanations.
Generalization to unseen data: On UCF-Crime, RFT models maintain higher VAU-Eval scores despite domain shift, showing better reasoning robustness than SFT.

Dataset	Model	QA Accuracy (%)		VAU-Eval (0-10)
Dataset	Model	Acc (w/o think)	Acc (w/ think)	CLS	KM	FLU	INF	FAC	Total
MSAD	InternVL2.5-2B	76.67	72.08	6.84	6.23	8.55	6.64	6.64	34.90
	Qwen2.5-VL-7B	84.58	83.33	6.75	6.41	9.27	7.74	6.92	37.08
	InternVL2.5-8B-MPO	82.50	84.17	6.83	6.33	8.32	6.37	6.86	34.72
	Qwen2-VL-2B	77.08	72.50	5.94	5.43	8.77	6.29	5.90	32.25
	+ SFT	82.92	85.83	6.04	5.43	8.89	6.55	5.93	32.84
	+ RFT	82.92 ↑5.84	83.75 ↑11.25	6.05 ↑	5.49 ↑	8.89	6.50 ↑	6.05 ↑	32.98 ↑
	Qwen2.5-VL-3B	85.83	82.50	5.77	5.24	9.02	6.74	5.70	32.47
	+ SFT	86.25	84.58	2.89	2.22	4.89	3.52	2.44	15.96
	+ RFT	88.33 ↑2.50	87.08 ↑4.58	5.97 ↑	5.49 ↑	9.05 ↑	6.84 ↑	6.03 ↑	33.38 ↑
UCF-Crime	InternVL2.5-2B	84.86	68.13	4.40	3.08	8.09	5.69	3.47	24.74
	Qwen2.5-VL-7B	92.03	89.64	4.80	3.73	8.95	7.05	4.25	28.78
	InternVL2.5-8B-MPO	89.64	90.44	3.79	3.20	8.23	5.77	3.48	24.47
	Qwen2-VL-2B	87.25	83.67	3.47	2.48	7.75	4.49	2.82	21.02
	+ SFT	83.67	86.06	3.61	2.26	7.30	4.79	2.70	20.66
	+ RFT	88.45 ↑1.20	88.05 ↑4.38	4.04 ↑	2.75 ↑	7.72 ↓	4.89 ↑	3.11 ↑	22.52 ↑
	Qwen2.5-VL-3B	91.63	83.27	4.31	2.88	8.70	5.95	3.27	25.10
	+ SFT	90.84	90.44	1.80	1.01	4.15	2.82	1.11	10.89
	+ RFT	92.03 ↑0.40	91.63 ↑8.36	4.42 ↑	2.98 ↑	8.71 ↑	5.98 ↑	3.39 ↑	25.49 ↑

Comparison of performance on MSAD and UCF-Crime datasets on multiple-choice QA task and anomaly analysis task. Results are reported for inference with and without Chain-of-Thought ("think") prompts.

📍 Temporal Anomaly Grounding

RFT consistently outperforms base models in both inference modes, improving temporal localization and generalization. Notably, the RFT-finetuned 3B model achieves higher mIoU than the larger 7B base model on ECVA.
Chain-of-Thought prompting does not always help grounding; in some cases, it slightly degrades performance.
RFT generalizes better to unseen data. While SFT can occasionally match RFT, its outputs are often repetitive and less interpretable.

Dataset	Model	w/o think				w/ think
Dataset	Model	mIoU	R@0.3	R@0.5	R@0.7	mIoU	R@0.3	R@0.5	R@0.7
MSAD	Qwen2-VL-2B	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	Qwen2.5-VL-7B	45.90	70.83	45.83	21.67	17.57	26.67	11.67	3.33
	Qwen2.5-VL-3B	21.27	30.00	10.83	4.17	13.00	16.67	5.83	1.67
	+ SFT	30.65	47.50	30.00	9.17	35.17	50.83	34.17	15.00
	+ RFT	35.77 ↑14.50	53.33	34.17	15.83	30.70 ↑17.70	48.33	29.17	12.50
ECVA	Qwen2-VL-2B	0.00	0.00	0.00	0.00	0.17	0.30	0.00	0.00
	Qwen2.5-VL-7B	19.85	25.87	15.17	9.70	5.71	7.96	4.73	2.99
	Qwen2.5-VL-3B	14.21	17.16	6.47	3.23	6.35	7.21	1.99	0.50
	+ SFT	45.30	66.67	49.75	24.13	45.96	65.67	51.00	26.12
	+ RFT	35.09 ↑20.88	49.00	28.86	19.40	33.25 ↑26.90	48.51	30.60	18.41
UCF-Crime (OOD)	Qwen2-VL-2B	2.74	4.84	0.00	0.00	0.12	0.00	0.00	0.00
	Qwen2.5-VL-7B	22.72	33.87	16.13	8.06	4.89	8.06	1.61	0.00
	Qwen2.5-VL-3B	10.91	15.32	6.45	3.23	7.68	10.48	4.84	1.61
	+ SFT	4.98	3.23	0.81	0.00	5.76	5.65	0.81	0.81
	+ RFT	16.80 ↑5.89	23.39	8.06	4.03	9.21 ↑1.53	9.68	4.03	1.61

Comparison of temporal anomaly grounding performance on the three datasets. we evaluate temporal anomaly grounding on three datasets: MSAD, ECVA, and UCF-Crime. All models are trained only on MSAD and ECVA, while UCF-Crime is treated as an out-of-distribution (OOD) test set to assess cross-dataset generalization.

🧪 Task Co-Training for Anomaly Classification

TAG boosts perception: RFT with temporal grounding alone yields the best binary (74.14) and strong multi-class (46.14) accuracy under w/ think, underscoring the importance of temporal context.
QA + TAG are synergistic: Combining QA and TAG improves performance, though TAG alone remains most effective.
SFT tends to overfit: Despite high binary accuracy (83.37), SFT underperforms in multi-class classification, indicating reduced precision.
Multi-task RFT balances trade-offs: Jointly training with QA, TAG, and CLS via RFT leads to well-rounded reasoning and accuracy.

Model	w/o think		w/ think
Model	Bin. Acc.	Multi Acc.	Bin. Acc.	Multi Acc.
Baseline (Qwen2.5-VL-3B-Instruct)	62.77	47.96	59.33	39.06
+ SFT w/ CLS	81.12	29.08	83.37	32.19
+ RFT w/ CLS	60.30	46.14	59.01	42.27
+ RFT w/ QA	59.01	46.14	58.91	41.95
+ RFT w/ TAG	67.81	49.46	74.14	46.14
+ RFT w/ QA-TAG	65.77	47.53	67.60	45.06
+ RFT w/ QA-TAG-CLS	64.70	48.61	65.02	45.60

Ablation study of task co-training for anomaly classification. Bin. Acc. = binary accuracy (normal vs. abnormal); Multi Acc. = multi-class accuracy across 19 anomaly types plus the normal class.

🔑 Key Insights

RFT outperforms SFT: Reinforcement Fine-Tuning leads to better reasoning accuracy and stronger generalization compared to standard supervised fine-tuning.
Interpretability through CoT: While Chain-of-Thought reasoning may not always boost raw visual understanding performance, it greatly enhances the interpretability of model outputs.
Reward-aligned decomposition matters: Breaking down complex visual tasks into reward-aligned subtasks enables more stable and robust learning during training.

✏️ Case Study

Multi-choice QA Temporal Anomaly Grounding Anomaly Analysis

Qualitative case of the QA task. The correct answer is highlighted in orange. RFT yields more precise, interpretable QA choices, while SFT's output is less informative.

Qualitative case of the TAG task. The ground-truth is highlighted in orange. RFT yields more precise anomaly intervals, while SFT’s output is inaccurate.

Qualitative case of the Anomaly Analysis task. Correct descriptions and analyses are highlighted in orange. VAU-R1 identifies the anomaly with high fluency, though omits reasoning for the core event. SFT's output is less accurate and tends to repeat.

📚 Dataset Examples

Explosion Stealing Normal

An explosion case in an outdoor backyard, highlighting complex anomaly detection and dynamic scene understanding. The clip is labeled with a question-answer pair, key visual evidence, anomaly type, and a multi-part reasoning chain covering location, cause-effect, and a high-level conclusion.

An example of a stealing incident, demonstrating capabilities in human activity recognition and intent analysis.

A normal scene, used to evaluate model robustness against false positives and to enhance dataset diversity.

BibTeX

@misc{zhu2025vaur1,
      title={VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning}, 
      author={Liyun Zhu and Qixiang Chen and Xi Shen and Xiaodong Cun},
      year={2025},
      eprint={2505.23504},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23504}, 
}