Overview of the motion prompt layer. Learnable Power Normalization (PN) function \( f(\cdot) \) modulates motion, influencing how motion is enhanced or dampened in each frame differencing map \( \mathbf{D} \) to highlight relevant movements. The resulting attention maps are multiplied element-wise (\(\odot\)) with the original video frame to produce video motion prompts. We introduce a temporal attention variation regularization term for smoother attention maps, ensuring better motion prompts. This layer can be inserted between the video input and backbones such as TimeSformer, serving as an adapter. Training involves optimizing both the motion prompt layer and the backbone network using a generic loss function, e.g., cross-entropy, along with the new regularization term.
*Double-click the slider thumb to reset its value; drag the middle chart to rotate.
*Double-click the slider thumb to reset its value; drag the middle chart to rotate.
*Double-click the slider thumb to reset its value; drag the middle chart to rotate.
*Double-click the slider thumb to reset its value; drag the middle chart to rotate.
*Drag the middle chart to rotate.
We compare existing PN with our PN on motion modulation. The first two columns show consecutive video frames, and the third column displays 3D surface plots of the corresponding frame differencing maps. The 4th to 7th columns show output attentions in both attention maps and 3D surface plots for Gamma, MaxExp, SigmE, and AsinhE. The last column shows outputs from our PN function, which focuses on different motions across video types, such as human actions, fine-grained actions, static and moving cameras, and anomaly detection. For UCF-Crime, we use the learned slope and shift from MPII Cooking 2, as both are captured by static cameras.
Visualizations include original consecutive frames (first two columns), frame differencing maps (third column), pairs of attention maps and motion prompts for Gamma, MaxExp, SigmE, and AsinhE (fourth to eleventh columns). The last two columns display our attention maps and motion prompts. Our attention maps (i) depict clear motion regions, (ii) highlight motions of interest and/or contextual environments relevant to the motions, and our motion prompts capture rich motion patterns. Existing PN functions only focus on motions, often capture noisy patterns and without emphasizing contexts.
@inproceedings{
chen2024motion,
title={Motion meets Attention: Video Motion Prompts},
author={Qixiang Chen and Lei Wang and Piotr Koniusz and Tom Gedeon},
booktitle={The 16th Asian Conference on Machine Learning (Conference Track)},
year={2024},
url={https://openreview.net/forum?id=nIDAT99Vhb}
}