Video Anomaly Understanding (VAU) involves detecting and interpreting irregular eventsβlike fighting or theftβin unstructured, real-world video. While early methods treat this as a binary classification task (normal vs. abnormal), they offer limited interpretability and fail to explain why anomalies occur. Recent progress in Multimodal Large Language Models (MLLMs) has improved transparency by generating textual descriptions. However, current approaches still face key challenges:
To move beyond shallow binary classification and enable deeper understanding, we decompose Video Anomaly Understanding (VAU) into four progressive reasoning stages:
Effectiveness of Reinforcement Fine-Tuning. We compare QA accuracy and temporal anomaly grounding performance across different models. VAU-R1, trained via Reinforcement Fine-Tuning (RFT), consistently outperforms its Supervised Fine-Tuning (SFT) counterpart. This demonstrates that RFT enhances both reasoning and temporal localization capabilities in VAU tasks.
Statistics of our VAU-Bench. (a) Distribution of main anomaly types. (b) Distribution of video durations (top) and the proportion of anomalous segments within each video (bottom). (c) The evaluation criteria for four VAU tasks.
We evaluate VAU-R1 on the VAU-Bench dataset, which encompasses diverse real-world scenarios from MSAD, UCF-Crime, and ECVA. The evaluation focuses on four key tasks: multiple-choice question answering (QA), temporal anomaly grounding, anomaly reasoning, and anomaly classification.
Accw/think
) than without (Accw/o think
), suggesting that unguided Chain-of-Thought generation may introduce hallucinations.Dataset | Model | QA Accuracy (%) | VAU-Eval (0-10) | ||||||
---|---|---|---|---|---|---|---|---|---|
Acc (w/o think) | Acc (w/ think) | CLS | KM | FLU | INF | FAC | Total | ||
MSAD | InternVL2.5-2B | 76.67 | 72.08 | 6.84 | 6.23 | 8.55 | 6.64 | 6.64 | 34.90 |
Qwen2.5-VL-7B | 84.58 | 83.33 | 6.75 | 6.41 | 9.27 | 7.74 | 6.92 | 37.08 | |
InternVL2.5-8B-MPO | 82.50 | 84.17 | 6.83 | 6.33 | 8.32 | 6.37 | 6.86 | 34.72 | |
Qwen2-VL-2B | 77.08 | 72.50 | 5.94 | 5.43 | 8.77 | 6.29 | 5.90 | 32.25 | |
+ SFT | 82.92 | 85.83 | 6.04 | 5.43 | 8.89 | 6.55 | 5.93 | 32.84 | |
+ RFT | 82.92 β5.84 | 83.75 β11.25 | 6.05 β | 5.49 β | 8.89 | 6.50 β | 6.05 β | 32.98 β | |
Qwen2.5-VL-3B | 85.83 | 82.50 | 5.77 | 5.24 | 9.02 | 6.74 | 5.70 | 32.47 | |
+ SFT | 86.25 | 84.58 | 2.89 | 2.22 | 4.89 | 3.52 | 2.44 | 15.96 | |
+ RFT | 88.33 β2.50 | 87.08 β4.58 | 5.97 β | 5.49 β | 9.05 β | 6.84 β | 6.03 β | 33.38 β | |
UCF-Crime | InternVL2.5-2B | 84.86 | 68.13 | 4.40 | 3.08 | 8.09 | 5.69 | 3.47 | 24.74 |
Qwen2.5-VL-7B | 92.03 | 89.64 | 4.80 | 3.73 | 8.95 | 7.05 | 4.25 | 28.78 | |
InternVL2.5-8B-MPO | 89.64 | 90.44 | 3.79 | 3.20 | 8.23 | 5.77 | 3.48 | 24.47 | |
Qwen2-VL-2B | 87.25 | 83.67 | 3.47 | 2.48 | 7.75 | 4.49 | 2.82 | 21.02 | |
+ SFT | 83.67 | 86.06 | 3.61 | 2.26 | 7.30 | 4.79 | 2.70 | 20.66 | |
+ RFT | 88.45 β1.20 | 88.05 β4.38 | 4.04 β | 2.75 β | 7.72 β | 4.89 β | 3.11 β | 22.52 β | |
Qwen2.5-VL-3B | 91.63 | 83.27 | 4.31 | 2.88 | 8.70 | 5.95 | 3.27 | 25.10 | |
+ SFT | 90.84 | 90.44 | 1.80 | 1.01 | 4.15 | 2.82 | 1.11 | 10.89 | |
+ RFT | 92.03 β0.40 | 91.63 β8.36 | 4.42 β | 2.98 β | 8.71 β | 5.98 β | 3.39 β | 25.49 β | |
Comparison of performance on MSAD and UCF-Crime datasets on multiple-choice QA task and anomaly reasoning task. Results are reported for inference with and without Chain-of-Thought ("think") prompts.
Dataset | Model | w/o think | w/ think | ||||||
---|---|---|---|---|---|---|---|---|---|
mIoU | R@0.3 | R@0.5 | R@0.7 | mIoU | R@0.3 | R@0.5 | R@0.7 | ||
MSAD | Qwen2-VL-2B | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Qwen2.5-VL-7B | 45.90 | 70.83 | 45.83 | 21.67 | 17.57 | 26.67 | 11.67 | 3.33 | |
Qwen2.5-VL-3B | 21.27 | 30.00 | 10.83 | 4.17 | 13.00 | 16.67 | 5.83 | 1.67 | |
+ SFT | 30.65 | 47.50 | 30.00 | 9.17 | 35.17 | 50.83 | 34.17 | 15.00 | |
+ RFT | 35.77 β14.50 | 53.33 | 34.17 | 15.83 | 30.70 β17.70 | 48.33 | 29.17 | 12.50 | |
ECVA | Qwen2-VL-2B | 0.00 | 0.00 | 0.00 | 0.00 | 0.17 | 0.30 | 0.00 | 0.00 |
Qwen2.5-VL-7B | 19.85 | 25.87 | 15.17 | 9.70 | 5.71 | 7.96 | 4.73 | 2.99 | |
Qwen2.5-VL-3B | 14.21 | 17.16 | 6.47 | 3.23 | 6.35 | 7.21 | 1.99 | 0.50 | |
+ SFT | 45.30 | 66.67 | 49.75 | 24.13 | 45.96 | 65.67 | 51.00 | 26.12 | |
+ RFT | 35.09 β20.88 | 49.00 | 28.86 | 19.40 | 33.25 β26.90 | 48.51 | 30.60 | 18.41 | |
UCF-Crime (OOD) | Qwen2-VL-2B | 2.74 | 4.84 | 0.00 | 0.00 | 0.12 | 0.00 | 0.00 | 0.00 |
Qwen2.5-VL-7B | 22.72 | 33.87 | 16.13 | 8.06 | 4.89 | 8.06 | 1.61 | 0.00 | |
Qwen2.5-VL-3B | 10.91 | 15.32 | 6.45 | 3.23 | 7.68 | 10.48 | 4.84 | 1.61 | |
+ SFT | 4.98 | 3.23 | 0.81 | 0.00 | 5.76 | 5.65 | 0.81 | 0.81 | |
+ RFT | 16.80 β5.89 | 23.39 | 8.06 | 4.03 | 9.21 β1.53 | 9.68 | 4.03 | 1.61 | |
Comparison of temporal anomaly grounding performance on the three datasets. we evaluate temporal anomaly grounding on three datasets: MSAD, ECVA, and UCF-Crime. All models are trained only on MSAD and ECVA, while UCF-Crime is treated as an out-of-distribution (OOD) test set to assess cross-dataset generalization.
w/ think
, underscoring the importance of temporal context.Model | w/o think | w/ think | ||
---|---|---|---|---|
Bin. Acc. | Multi Acc. | Bin. Acc. | Multi Acc. | |
Baseline (Qwen2.5-VL-3B-Instruct) | 62.77 | 47.96 | 59.33 | 39.06 |
+ SFT w/ CLS | 81.12 | 29.08 | 83.37 | 32.19 |
+ RFT w/ CLS | 60.30 | 46.14 | 59.01 | 42.27 |
+ RFT w/ QA | 59.01 | 46.14 | 58.91 | 41.95 |
+ RFT w/ TAG | 67.81 | 49.46 | 74.14 | 46.14 |
+ RFT w/ QA-TAG | 65.77 | 47.53 | 67.60 | 45.06 |
+ RFT w/ QA-TAG-CLS | 64.70 | 48.61 | 65.02 | 45.60 |
Ablation study of task co-training for anomaly classification. Bin. Acc. = binary accuracy (normal vs. abnormal); Multi Acc. = multi-class accuracy across 19 anomaly types plus the normal class.
Qualitative case of the QA task. The correct answer is highlighted in orange. RFT yields more precise, interpretable QA choices, while SFT's output is less informative.
Qualitative case of the TAG task. The ground-truth is highlighted in orange. RFT yields more precise anomaly intervals, while SFTβs output is inaccurate.
Qualitative case of the Anomaly Reasoning task. Correct descriptions and analyses are highlighted in orange. VAU-R1 identifies the anomaly with high fluency, though omits reasoning for the core event. SFT's output is less accurate and tends to repeat.
An explosion case in an outdoor backyard, highlighting complex anomaly detection and dynamic scene understanding. The clip is labeled with a question-answer pair, key visual evidence, anomaly type, and a multi-part reasoning chain covering location, cause-effect, and a high-level conclusion.
An example of a stealing incident, demonstrating capabilities in human activity recognition and intent analysis.
A normal scene, used to evaluate model robustness against false positives and to enhance dataset diversity.
@misc{zhu2025vaur1,
title={VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-Tuning},
author={Liyun Zhu and Qixiang Chen and Xi Shen and Xiaodong Cun},
year={2025},
eprint={2505.23504},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23504},
}