FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian; Frank Guerin; Andrew Gilbert

Back

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Conference proceeding

Open access

Peer reviewed

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian, Frank Guerin and Andrew Gilbert

British Machine Vision Confernce, 35 (Glasgow, 25/11/2024–28/11/2024)

2025

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel Self-Supervised Video Feature Prediction In Semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

Files and links (1)

pdf

FILSPaper (1)6.72 MBDownload View

Author's Accepted Manuscript Open Access

Metrics

6 File views/ downloads

40 Record Views

Details

Title: FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
Creators: Mona Ahmadian (Author) - University of Surrey, Music and Media
Frank Guerin (Author) - University of Surrey, School of Computer Science & Electronic Engineering
Andrew Gilbert (Author) - University of Surrey, Music and Media
Conference: British Machine Vision Confernce, 35 (Glasgow, 25/11/2024–28/11/2024)
Publication Date: 2025
Date accepted for publication: 07/10/2024
Identifiers: 99925862202346
Academic Unit: Music and Media
Resource Type: Conference proceeding

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Abstract

Files and links (1)

Metrics

Details

Usage Policy