Abstract
This work proposes UIL-AQA for long-term Action Quality Assessment AQA designed to be cliplevel interpretable and uncertainty-aware. AQA evaluates the execution quality of actions in videos. However, the complexity and diversity of actions, especially in long videos, increases the difficulty of AQA. Existing AQA methods solve this by limiting themselves generally to short-term videos. These approaches lack detailed semantic interpretation for individual clips and fail to account for the impact of human biases and subjectivity in the data during model training. Moreover, although querybased Transformer networks demonstrate strong capabilities in long-term modelling, their interpretability in AQA remains insufficient. This is primarily due to a phenomenon we identified, termed Temporal Skipping, where the model skips self-attention layers to prevent output degradation. We introduce an Attention Loss function and a Query Initialization Module to enhance the modelling capability of query-based Transformer networks. Additionally, we incorporate a Gaussian Noise Injection Module to simulate biases in human scoring, mitigating the influence of uncertainty and improving model reliability. Furthermore, we propose a Difficulty-Quality Regression Module, which decomposes each clip’s action score into independent difficulty and quality components, enabling a more fine-grained and interpretable evaluation. Our extensive quantitative and qualitative analysis demonstrates that our proposed method achieves state-of-the-art performance on three long-term real-world AQA datasets. Our code is available at: GitHub Repository.