VAU-R1: Advancing Video Anomaly Understanding via
Reinforcement Fine-Tuning

1Australian National University     2GVC Lab, Great Bay University     3Intellindust
VAU-R1 Teaser Image

πŸ’‘ Motivation

Video Anomaly Understanding (VAU) is vital for real-world applications such as smart cities, security surveillance, and disaster response. Unlike conventional video analysis tasks, VAU demands fine-grained inspection and multi-step reasoning in complex, dynamic, and often unpredictable environments. While early research focused primarily on binary classification (normal vs. abnormal) and temporal localization, such methods offer limited interpretability and provide little understanding of the underlying causes of anomalies.

Recent advances in Multimodal Large Language Models (MLLMs) have improved transparency by generating textual descriptions of anomalous events. However, four major challenges remain:

  1. Lack of coherent, multi-step reasoning chains to explain anomalies
  2. Unclear strategies for decomposing complex VAU tasks into effective subtasks for model supervision
  3. Absence of rich, annotated benchmarks that support structured causal reasoning
  4. Insufficient evaluation protocols to assess the quality of visual reasoning

To move beyond shallow detection and enable deeper, interpretable anomaly understanding, we ask a central question: What types of tasks (or reasoning pathways) can help enhance the understanding of anomalies? To explore this, we introduce a framework consisting of three parallel training tasks. These tasks are co-trained either individually or in combination, and their effectiveness is evaluated through a dedicated anomaly analysis task.

  • Perception: Identify the scene and relevant entities through guided multiple-choice questions
  • Grounding: Accurately localize the temporal segment where the anomaly occurs
  • Classification: Categorize the anomaly based on observed evidence (e.g., fighting, robbery)
  • Analysis: Evaluate how well the model can explain why the anomaly occurred through structured causal reasoning

This decomposition enables models to acquire diverse reasoning skills, build deeper semantic understanding, and generate more interpretable outputs aligned with each taskβ€”ultimately supporting more robust and explainable video anomaly analysis.

πŸ” Key Contributions

  • VAU-R1: A data-efficient Reinforcement Fine-Tuning (RFT) framework that enhances the multi-step reasoning capabilities of Multimodal Large Language Models (MLLMs) for video anomaly understanding. It is optimized using task-specific reward signals and consistently outperforms supervised fine-tuning across VAU tasks.
  • VAU-Bench: The first benchmark tailored for Chain-of-Thought video anomaly reasoning. It includes 4,600+ videos with diverse anomaly types and provides rich annotations such as multiple-choice QA, temporal intervals, anomaly classes, and step-by-step reasoning rationales.
  • Unified Evaluation Protocol: A comprehensive evaluation framework that assesses models along four key dimensions: perception accuracy, temporal grounding precision, classification correctness, and reasoning qualityβ€”capturing both interpretability and analytical depth.

Statistics of our VAU-Bench. (a) Distribution of main anomaly types. (b) Distribution of video durations (top) and the proportion of anomalous segments within each video (bottom). (c) The evaluation criteria for four VAU tasks.

πŸ“Š Results

We evaluate VAU-R1 on the VAU-Bench dataset, which encompasses diverse real-world scenarios from MSAD, UCF-Crime, and ECVA. The evaluation focuses on four key tasks: multiple-choice question answering (QA), temporal anomaly grounding, anomaly classification, and anomaly analysis.

🧠 Multiple-Choice QA

🧩 Anomaly Analysis

Dataset Model QA Accuracy (%) VAU-Eval (0-10)
Acc (w/o think) Acc (w/ think) CLS KM FLU INF FAC Total
MSADInternVL2.5-2B76.6772.086.846.238.556.646.6434.90
Qwen2.5-VL-7B84.5883.336.756.419.277.746.9237.08
InternVL2.5-8B-MPO82.5084.176.836.338.326.376.8634.72
Qwen2-VL-2B77.0872.505.945.438.776.295.9032.25
+ SFT82.9285.836.045.438.896.555.9332.84
+ RFT82.92 ↑5.8483.75 ↑11.256.05 ↑5.49 ↑8.896.50 ↑6.05 ↑32.98 ↑
Qwen2.5-VL-3B85.8382.505.775.249.026.745.7032.47
+ SFT86.2584.582.892.224.893.522.4415.96
+ RFT88.33 ↑2.5087.08 ↑4.585.97 ↑5.49 ↑9.05 ↑6.84 ↑6.03 ↑33.38 ↑
UCF-CrimeInternVL2.5-2B84.8668.134.403.088.095.693.4724.74
Qwen2.5-VL-7B92.0389.644.803.738.957.054.2528.78
InternVL2.5-8B-MPO89.6490.443.793.208.235.773.4824.47
Qwen2-VL-2B87.2583.673.472.487.754.492.8221.02
+ SFT83.6786.063.612.267.304.792.7020.66
+ RFT88.45 ↑1.2088.05 ↑4.384.04 ↑2.75 ↑7.72 ↓4.89 ↑3.11 ↑22.52 ↑
Qwen2.5-VL-3B91.6383.274.312.888.705.953.2725.10
+ SFT90.8490.441.801.014.152.821.1110.89
+ RFT92.03 ↑0.4091.63 ↑8.364.42 ↑2.98 ↑8.71 ↑5.98 ↑3.39 ↑25.49 ↑

Comparison of performance on MSAD and UCF-Crime datasets on multiple-choice QA task and anomaly analysis task. Results are reported for inference with and without Chain-of-Thought ("think") prompts.

πŸ“ Temporal Anomaly Grounding

Dataset Model w/o think w/ think
mIoU R@0.3 R@0.5 R@0.7 mIoU R@0.3 R@0.5 R@0.7
MSADQwen2-VL-2B0.000.000.000.000.000.000.000.00
Qwen2.5-VL-7B45.9070.8345.8321.6717.5726.6711.673.33
Qwen2.5-VL-3B21.2730.0010.834.1713.0016.675.831.67
+ SFT30.6547.5030.009.1735.1750.8334.1715.00
+ RFT35.77 ↑14.5053.3334.1715.8330.70 ↑17.7048.3329.1712.50
ECVAQwen2-VL-2B0.000.000.000.000.170.300.000.00
Qwen2.5-VL-7B19.8525.8715.179.705.717.964.732.99
Qwen2.5-VL-3B14.2117.166.473.236.357.211.990.50
+ SFT45.3066.6749.7524.1345.9665.6751.0026.12
+ RFT35.09 ↑20.8849.0028.8619.4033.25 ↑26.9048.5130.6018.41
UCF-Crime (OOD)Qwen2-VL-2B2.744.840.000.000.120.000.000.00
Qwen2.5-VL-7B22.7233.8716.138.064.898.061.610.00
Qwen2.5-VL-3B10.9115.326.453.237.6810.484.841.61
+ SFT4.983.230.810.005.765.650.810.81
+ RFT16.80 ↑5.8923.398.064.039.21 ↑1.539.684.031.61

Comparison of temporal anomaly grounding performance on the three datasets. we evaluate temporal anomaly grounding on three datasets: MSAD, ECVA, and UCF-Crime. All models are trained only on MSAD and ECVA, while UCF-Crime is treated as an out-of-distribution (OOD) test set to assess cross-dataset generalization.

πŸ§ͺ Task Co-Training for Anomaly Classification

Model w/o think w/ think
Bin. Acc. Multi Acc. Bin. Acc. Multi Acc.
Baseline (Qwen2.5-VL-3B-Instruct) 62.77 47.96 59.33 39.06
+ SFT w/ CLS 81.12 29.08 83.37 32.19
+ RFT w/ CLS 60.30 46.14 59.01 42.27
+ RFT w/ QA 59.01 46.14 58.91 41.95
+ RFT w/ TAG 67.81 49.46 74.14 46.14
+ RFT w/ QA-TAG 65.77 47.53 67.60 45.06
+ RFT w/ QA-TAG-CLS 64.70 48.61 65.02 45.60

Ablation study of task co-training for anomaly classification. Bin. Acc. = binary accuracy (normal vs. abnormal); Multi Acc. = multi-class accuracy across 19 anomaly types plus the normal class.

πŸ”‘ Key Insights

✏️ Case Study

Multi-choice QA Temporal Anomaly Grounding Anomaly Analysis

πŸ“š Dataset Examples

Explosion Stealing Normal

BibTeX

@misc{zhu2025vaur1,
      title={VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning}, 
      author={Liyun Zhu and Qixiang Chen and Xi Shen and Xiaodong Cun},
      year={2025},
      eprint={2505.23504},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23504}, 
}