Abstract
In the rapidly evolving world of visual media, accurately recognising and localising complex actions in untrimmed, real-world videos remains a fundamental challenge. It demands modelling fine-grained motion dynamics, capturing long-range temporal dependencies, and interpreting high-level semantics across multiple modalities. Despite remarkable progress in visual representation learning, existing methods often struggle to reason over complex motion, align visual features with linguistic meaning, and integrate complementary audio cues for context-aware understanding. This thesis addresses these challenges through a unified investigation of motion reasoning, language-grounded semantics, and multimodal integration, aiming to enhance the interpretability and precision of video understanding systems.
First, we propose MOFO (MOtion FOcused Self-Supervision for Video Understanding), a framework that explicitly models motion dynamics during self-supervised pretraining and finetuning for action recognition. MOFO automatically detects motion-sensitive regions using a motion map derived from optical flow derivatives, highlighting motion boundaries while reducing the impact of camera movement and background noise. A motion-guided masking strategy focuses learning on dynamic, action-relevant areas, encouraging the model to capture motion cues rather than static appearance. During finetuning, a multi-cross attention mechanism fuses embeddings from inside and outside the detected motion regions, improving temporal reasoning and contextual understanding. By integrating motion guidance across both pretraining and finetuning, MOFO produces interpretable, motion-aware representations and significantly enhances self-supervised action recognition performance.
Next, we introduce FILS (Self-Supervised Video Feature Prediction in Semantic Language Space), a framework that enhances video representation learning by predicting features within a language-aligned semantic space. FILS first constructs this semantic space through contrastive learning between motion-relevant video patches and automatically generated textual descriptions, aligning dynamic visual regions with their corresponding language embeddings. It then performs feature prediction for the masked video patches within the learnt language space, allowing the model to capture high-level semantic meaning without pixel reconstruction. By combining motion-aware contrastive learning with language-guided feature prediction, FILS learns interpretable and transferable video representations that bridge visual motion dynamics with linguistic understanding.
Finally, we present DEL (Dense Event Localisation for Multimodal Audio-Visual Understanding), a multimodal framework for precise and fine-grained event detection in long, untrimmed videos where events of different durations may overlap and occur asynchronously. Unlike MOFO and FILS, which focus on self-supervised representation learning, DEL employs supervised multimodal learning to model dense temporal structures and cross-modal dependencies in realistic video environments. It integrates visual and audio modalities through an adaptive cross-modal attention mechanism that aligns asynchronous cues and preserves temporal coherence. To improve feature discrimination and robustness, DEL introduces a score-based dual contrastive learning strategy that enhances intra- and inter-modal consistency. A hierarchical temporal fusion module further aggregates information across multiple temporal scales, enabling the model to capture both short-term motion details and long-range contextual dependencies. In combination, these components allow DEL to accurately detect overlapping and asynchronous audio–visual events while remaining computationally efficient.
Together, these studies form a coherent progression toward comprehensive video understanding, evolving from motion perception to semantic reasoning and finally to multimodal temporal analysis. The results show that focusing on motion leads to stronger and more interpretable representations, grounding visual features in language enhances semantic understanding, and integrating audio and visual modalities enables precise and temporally coherent event localisation. Overall, these contributions advance unified and semantically grounded approaches to video understanding, bridging the gap between visual perception and high-level reasoning.