VAU-R1: Advancing Video Anomaly Understanding via
Reinforcement Fine-Tuning

1Australian National University     2GVC Lab, Great Bay University     3Intellindust
VAU-R1 Teaser Image

πŸ’‘ Motivation

Video Anomaly Understanding (VAU) involves detecting and interpreting irregular eventsβ€”like fighting or theftβ€”in unstructured, real-world video. While early methods treat this as a binary classification task (normal vs. abnormal), they offer limited interpretability and fail to explain why anomalies occur. Recent progress in Multimodal Large Language Models (MLLMs) has improved transparency by generating textual descriptions. However, current approaches still face key challenges:

  1. No coherent, multi-step reasoning to explain anomalies
  2. Lack of benchmarks with rich annotations for causal reasoning
  3. Underdeveloped evaluation protocols for reasoning quality

To move beyond shallow binary classification and enable deeper understanding, we decompose Video Anomaly Understanding (VAU) into four progressive reasoning stages:

  1. Perception: Identify the scene and relevant objects via free-text or multiple-choice questions.
  2. Grounding: Localize the temporal segment where the anomaly occurs.
  3. Reasoning: Explain the event based on causal relationships, temporal context, and scene dynamics.
  4. Conclusion: Make a final decision (e.g., classifying the anomaly type such as fighting or robbery).
Results Overview

Effectiveness of Reinforcement Fine-Tuning. We compare QA accuracy and temporal anomaly grounding performance across different models. VAU-R1, trained via Reinforcement Fine-Tuning (RFT), consistently outperforms its Supervised Fine-Tuning (SFT) counterpart. This demonstrates that RFT enhances both reasoning and temporal localization capabilities in VAU tasks.

πŸ” Key Contributions

  • VAU-R1: a data-efficient Reinforcement Fine-Tuning framework that improves the reasoning ability of MLLMs for video anomaly understanding. It outperforms standard supervised fine-tuning on reasoning-intensive tasks.
  • VAU-Bench: The first large-scale benchmark with Chain-of-Thought annotations designed for video anomaly reasoning. It contains a diverse collection of videos, QA pairs, temporal labels, and detailed rationales spanning a wide range of real-world scenarios.
  • Unified Evaluation: A structured protocol that measures model performance across four reasoning stages, jointly considering reasoning quality, answer correctness, and temporal localization to capture both interpretability and detection precision.

Statistics of our VAU-Bench. (a) Distribution of main anomaly types. (b) Distribution of video durations (top) and the proportion of anomalous segments within each video (bottom). (c) The evaluation criteria for four VAU tasks.

πŸ“Š Results

We evaluate VAU-R1 on the VAU-Bench dataset, which encompasses diverse real-world scenarios from MSAD, UCF-Crime, and ECVA. The evaluation focuses on four key tasks: multiple-choice question answering (QA), temporal anomaly grounding, anomaly reasoning, and anomaly classification.

🧠 Multi-choice QA & QA-Guided Reasoning

Dataset Model QA Accuracy (%) VAU-Eval (0-10)
Acc (w/o think) Acc (w/ think) CLS KM FLU INF FAC Total
MSADInternVL2.5-2B76.6772.086.846.238.556.646.6434.90
Qwen2.5-VL-7B84.5883.336.756.419.277.746.9237.08
InternVL2.5-8B-MPO82.5084.176.836.338.326.376.8634.72
Qwen2-VL-2B77.0872.505.945.438.776.295.9032.25
+ SFT82.9285.836.045.438.896.555.9332.84
+ RFT82.92 ↑5.8483.75 ↑11.256.05 ↑5.49 ↑8.896.50 ↑6.05 ↑32.98 ↑
Qwen2.5-VL-3B85.8382.505.775.249.026.745.7032.47
+ SFT86.2584.582.892.224.893.522.4415.96
+ RFT88.33 ↑2.5087.08 ↑4.585.97 ↑5.49 ↑9.05 ↑6.84 ↑6.03 ↑33.38 ↑
UCF-CrimeInternVL2.5-2B84.8668.134.403.088.095.693.4724.74
Qwen2.5-VL-7B92.0389.644.803.738.957.054.2528.78
InternVL2.5-8B-MPO89.6490.443.793.208.235.773.4824.47
Qwen2-VL-2B87.2583.673.472.487.754.492.8221.02
+ SFT83.6786.063.612.267.304.792.7020.66
+ RFT88.45 ↑1.2088.05 ↑4.384.04 ↑2.75 ↑7.72 ↓4.89 ↑3.11 ↑22.52 ↑
Qwen2.5-VL-3B91.6383.274.312.888.705.953.2725.10
+ SFT90.8490.441.801.014.152.821.1110.89
+ RFT92.03 ↑0.4091.63 ↑8.364.42 ↑2.98 ↑8.71 ↑5.98 ↑3.39 ↑25.49 ↑

Comparison of performance on MSAD and UCF-Crime datasets on multiple-choice QA task and anomaly reasoning task. Results are reported for inference with and without Chain-of-Thought ("think") prompts.

πŸ“ Temporal Anomaly Grounding

Dataset Model w/o think w/ think
mIoU R@0.3 R@0.5 R@0.7 mIoU R@0.3 R@0.5 R@0.7
MSADQwen2-VL-2B0.000.000.000.000.000.000.000.00
Qwen2.5-VL-7B45.9070.8345.8321.6717.5726.6711.673.33
Qwen2.5-VL-3B21.2730.0010.834.1713.0016.675.831.67
+ SFT30.6547.5030.009.1735.1750.8334.1715.00
+ RFT35.77 ↑14.5053.3334.1715.8330.70 ↑17.7048.3329.1712.50
ECVAQwen2-VL-2B0.000.000.000.000.170.300.000.00
Qwen2.5-VL-7B19.8525.8715.179.705.717.964.732.99
Qwen2.5-VL-3B14.2117.166.473.236.357.211.990.50
+ SFT45.3066.6749.7524.1345.9665.6751.0026.12
+ RFT35.09 ↑20.8849.0028.8619.4033.25 ↑26.9048.5130.6018.41
UCF-Crime (OOD)Qwen2-VL-2B2.744.840.000.000.120.000.000.00
Qwen2.5-VL-7B22.7233.8716.138.064.898.061.610.00
Qwen2.5-VL-3B10.9115.326.453.237.6810.484.841.61
+ SFT4.983.230.810.005.765.650.810.81
+ RFT16.80 ↑5.8923.398.064.039.21 ↑1.539.684.031.61

Comparison of temporal anomaly grounding performance on the three datasets. we evaluate temporal anomaly grounding on three datasets: MSAD, ECVA, and UCF-Crime. All models are trained only on MSAD and ECVA, while UCF-Crime is treated as an out-of-distribution (OOD) test set to assess cross-dataset generalization.

πŸ§ͺ Task Co-Training for Anomaly Classification

Model w/o think w/ think
Bin. Acc. Multi Acc. Bin. Acc. Multi Acc.
Baseline (Qwen2.5-VL-3B-Instruct) 62.77 47.96 59.33 39.06
+ SFT w/ CLS 81.12 29.08 83.37 32.19
+ RFT w/ CLS 60.30 46.14 59.01 42.27
+ RFT w/ QA 59.01 46.14 58.91 41.95
+ RFT w/ TAG 67.81 49.46 74.14 46.14
+ RFT w/ QA-TAG 65.77 47.53 67.60 45.06
+ RFT w/ QA-TAG-CLS 64.70 48.61 65.02 45.60

Ablation study of task co-training for anomaly classification. Bin. Acc. = binary accuracy (normal vs. abnormal); Multi Acc. = multi-class accuracy across 19 anomaly types plus the normal class.

πŸ”‘ Key Insights

✏️ Case Study

Multi-choice QA Temporal Anomaly Grounding Anomaly Reasoning

πŸ“š Dataset Examples

Explosion Stealing Normal

BibTeX

@misc{zhu2025vaur1,
      title={VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning}, 
      author={Liyun Zhu and Qixiang Chen and Xi Shen and Xiaodong Cun},
      year={2025},
      eprint={2505.23504},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23504}, 
}