Video Anomaly Understanding (VAU) is vital for real-world applications such as smart cities, security surveillance, and disaster response. Unlike conventional video analysis tasks, VAU demands fine-grained inspection and multi-step reasoning in complex, dynamic, and often unpredictable environments. While early research focused primarily on binary classification (normal vs. abnormal) and temporal localization, such methods offer limited interpretability and provide little understanding of the underlying causes of anomalies.
Recent advances in Multimodal Large Language Models (MLLMs) have improved transparency by generating textual descriptions of anomalous events. However, four major challenges remain:
To move beyond shallow detection and enable deeper, interpretable anomaly understanding, we ask a central question: What types of tasks (or reasoning pathways) can help enhance the understanding of anomalies? To explore this, we introduce a framework consisting of three parallel training tasks. These tasks are co-trained either individually or in combination, and their effectiveness is evaluated through a dedicated anomaly analysis task.
This decomposition enables models to acquire diverse reasoning skills, build deeper semantic understanding, and generate more interpretable outputs aligned with each taskβultimately supporting more robust and explainable video anomaly analysis.
Statistics of our VAU-Bench. (a) Distribution of main anomaly types. (b) Distribution of video durations (top) and the proportion of anomalous segments within each video (bottom). (c) The evaluation criteria for four VAU tasks.
We evaluate VAU-R1 on the VAU-Bench dataset, which encompasses diverse real-world scenarios from MSAD, UCF-Crime, and ECVA. The evaluation focuses on four key tasks: multiple-choice question answering (QA), temporal anomaly grounding, anomaly classification, and anomaly analysis.
<think>
prompts, indicating unstructured reasoning introduces noise.Accw/o think
and Accw/think
, while maintaining output structure.Dataset | Model | QA Accuracy (%) | VAU-Eval (0-10) | ||||||
---|---|---|---|---|---|---|---|---|---|
Acc (w/o think) | Acc (w/ think) | CLS | KM | FLU | INF | FAC | Total | ||
MSAD | InternVL2.5-2B | 76.67 | 72.08 | 6.84 | 6.23 | 8.55 | 6.64 | 6.64 | 34.90 |
Qwen2.5-VL-7B | 84.58 | 83.33 | 6.75 | 6.41 | 9.27 | 7.74 | 6.92 | 37.08 | |
InternVL2.5-8B-MPO | 82.50 | 84.17 | 6.83 | 6.33 | 8.32 | 6.37 | 6.86 | 34.72 | |
Qwen2-VL-2B | 77.08 | 72.50 | 5.94 | 5.43 | 8.77 | 6.29 | 5.90 | 32.25 | |
+ SFT | 82.92 | 85.83 | 6.04 | 5.43 | 8.89 | 6.55 | 5.93 | 32.84 | |
+ RFT | 82.92 β5.84 | 83.75 β11.25 | 6.05 β | 5.49 β | 8.89 | 6.50 β | 6.05 β | 32.98 β | |
Qwen2.5-VL-3B | 85.83 | 82.50 | 5.77 | 5.24 | 9.02 | 6.74 | 5.70 | 32.47 | |
+ SFT | 86.25 | 84.58 | 2.89 | 2.22 | 4.89 | 3.52 | 2.44 | 15.96 | |
+ RFT | 88.33 β2.50 | 87.08 β4.58 | 5.97 β | 5.49 β | 9.05 β | 6.84 β | 6.03 β | 33.38 β | |
UCF-Crime | InternVL2.5-2B | 84.86 | 68.13 | 4.40 | 3.08 | 8.09 | 5.69 | 3.47 | 24.74 |
Qwen2.5-VL-7B | 92.03 | 89.64 | 4.80 | 3.73 | 8.95 | 7.05 | 4.25 | 28.78 | |
InternVL2.5-8B-MPO | 89.64 | 90.44 | 3.79 | 3.20 | 8.23 | 5.77 | 3.48 | 24.47 | |
Qwen2-VL-2B | 87.25 | 83.67 | 3.47 | 2.48 | 7.75 | 4.49 | 2.82 | 21.02 | |
+ SFT | 83.67 | 86.06 | 3.61 | 2.26 | 7.30 | 4.79 | 2.70 | 20.66 | |
+ RFT | 88.45 β1.20 | 88.05 β4.38 | 4.04 β | 2.75 β | 7.72 β | 4.89 β | 3.11 β | 22.52 β | |
Qwen2.5-VL-3B | 91.63 | 83.27 | 4.31 | 2.88 | 8.70 | 5.95 | 3.27 | 25.10 | |
+ SFT | 90.84 | 90.44 | 1.80 | 1.01 | 4.15 | 2.82 | 1.11 | 10.89 | |
+ RFT | 92.03 β0.40 | 91.63 β8.36 | 4.42 β | 2.98 β | 8.71 β | 5.98 β | 3.39 β | 25.49 β | |
Comparison of performance on MSAD and UCF-Crime datasets on multiple-choice QA task and anomaly analysis task. Results are reported for inference with and without Chain-of-Thought ("think") prompts.
Dataset | Model | w/o think | w/ think | ||||||
---|---|---|---|---|---|---|---|---|---|
mIoU | R@0.3 | R@0.5 | R@0.7 | mIoU | R@0.3 | R@0.5 | R@0.7 | ||
MSAD | Qwen2-VL-2B | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Qwen2.5-VL-7B | 45.90 | 70.83 | 45.83 | 21.67 | 17.57 | 26.67 | 11.67 | 3.33 | |
Qwen2.5-VL-3B | 21.27 | 30.00 | 10.83 | 4.17 | 13.00 | 16.67 | 5.83 | 1.67 | |
+ SFT | 30.65 | 47.50 | 30.00 | 9.17 | 35.17 | 50.83 | 34.17 | 15.00 | |
+ RFT | 35.77 β14.50 | 53.33 | 34.17 | 15.83 | 30.70 β17.70 | 48.33 | 29.17 | 12.50 | |
ECVA | Qwen2-VL-2B | 0.00 | 0.00 | 0.00 | 0.00 | 0.17 | 0.30 | 0.00 | 0.00 |
Qwen2.5-VL-7B | 19.85 | 25.87 | 15.17 | 9.70 | 5.71 | 7.96 | 4.73 | 2.99 | |
Qwen2.5-VL-3B | 14.21 | 17.16 | 6.47 | 3.23 | 6.35 | 7.21 | 1.99 | 0.50 | |
+ SFT | 45.30 | 66.67 | 49.75 | 24.13 | 45.96 | 65.67 | 51.00 | 26.12 | |
+ RFT | 35.09 β20.88 | 49.00 | 28.86 | 19.40 | 33.25 β26.90 | 48.51 | 30.60 | 18.41 | |
UCF-Crime (OOD) | Qwen2-VL-2B | 2.74 | 4.84 | 0.00 | 0.00 | 0.12 | 0.00 | 0.00 | 0.00 |
Qwen2.5-VL-7B | 22.72 | 33.87 | 16.13 | 8.06 | 4.89 | 8.06 | 1.61 | 0.00 | |
Qwen2.5-VL-3B | 10.91 | 15.32 | 6.45 | 3.23 | 7.68 | 10.48 | 4.84 | 1.61 | |
+ SFT | 4.98 | 3.23 | 0.81 | 0.00 | 5.76 | 5.65 | 0.81 | 0.81 | |
+ RFT | 16.80 β5.89 | 23.39 | 8.06 | 4.03 | 9.21 β1.53 | 9.68 | 4.03 | 1.61 | |
Comparison of temporal anomaly grounding performance on the three datasets. we evaluate temporal anomaly grounding on three datasets: MSAD, ECVA, and UCF-Crime. All models are trained only on MSAD and ECVA, while UCF-Crime is treated as an out-of-distribution (OOD) test set to assess cross-dataset generalization.
w/ think
, underscoring the importance of temporal context.Model | w/o think | w/ think | ||
---|---|---|---|---|
Bin. Acc. | Multi Acc. | Bin. Acc. | Multi Acc. | |
Baseline (Qwen2.5-VL-3B-Instruct) | 62.77 | 47.96 | 59.33 | 39.06 |
+ SFT w/ CLS | 81.12 | 29.08 | 83.37 | 32.19 |
+ RFT w/ CLS | 60.30 | 46.14 | 59.01 | 42.27 |
+ RFT w/ QA | 59.01 | 46.14 | 58.91 | 41.95 |
+ RFT w/ TAG | 67.81 | 49.46 | 74.14 | 46.14 |
+ RFT w/ QA-TAG | 65.77 | 47.53 | 67.60 | 45.06 |
+ RFT w/ QA-TAG-CLS | 64.70 | 48.61 | 65.02 | 45.60 |
Ablation study of task co-training for anomaly classification. Bin. Acc. = binary accuracy (normal vs. abnormal); Multi Acc. = multi-class accuracy across 19 anomaly types plus the normal class.
Qualitative case of the QA task. The correct answer is highlighted in orange. RFT yields more precise, interpretable QA choices, while SFT's output is less informative.
Qualitative case of the TAG task. The ground-truth is highlighted in orange. RFT yields more precise anomaly intervals, while SFTβs output is inaccurate.
Qualitative case of the Anomaly Analysis task. Correct descriptions and analyses are highlighted in orange. VAU-R1 identifies the anomaly with high fluency, though omits reasoning for the core event. SFT's output is less accurate and tends to repeat.
An explosion case in an outdoor backyard, highlighting complex anomaly detection and dynamic scene understanding. The clip is labeled with a question-answer pair, key visual evidence, anomaly type, and a multi-part reasoning chain covering location, cause-effect, and a high-level conclusion.
An example of a stealing incident, demonstrating capabilities in human activity recognition and intent analysis.
A normal scene, used to evaluate model robustness against false positives and to enhance dataset diversity.
@misc{zhu2025vaur1,
title={VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning},
author={Liyun Zhu and Qixiang Chen and Xi Shen and Xiaodong Cun},
year={2025},
eprint={2505.23504},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23504},
}