Fine-Grained Vision-Language Learning

Xiao Han

doi:10.15126/thesis.901156

Back

Doctoral Thesis

Open access

Fine-Grained Vision-Language Learning

Xiao Han

University of Surrey

Doctor of Philosophy (PhD), University of Surrey

DOI:

https://doi.org/10.15126/thesis.901156

Abstract

Deep Learning

Vision-Language

In the ever-evolving landscape of Vision-Language (V+L) learning, the synergy between visual and textual information has proven pivotal for a multitude of tasks, ranging from discriminative to generative objectives. Nevertheless, in specific fine-grained contexts within practical applications, such as e-commerce and human-related modeling, the intricate characteristics of individual instances are difficult to represent, distinguish, and generate. Generic V+L methods often struggle in these nuanced situations due to the lack of specialized designs to address the unique attributes inherent to fine-grained tasks. In light of these challenges, this thesis delves into the world of fine-grained vision-language learning, proposing innovative solutions for typical fine-grained V+L cases to propel the field forward.

Three contributions are made in this thesis. First, we investigate how to learn better fine-grained V+L representations. We present novel pre-training objectives specifically tailored to the unique attributes of the fashion domain, along with a flexible and versatile pretraining architecture. This approach is designed to offer more discriminative and generalizable features, enhancing the performance of a wide range of downstream tasks in the fashion domain. Second, we study how to parameter-efficiently unify fine-grained heterogeneous V+L tasks in a multi-task model. We propose two lightweight adapters and a stable optimization strategy to support simultaneously training a V+L model across multiple heterogeneous tasks, which outperforms independently trained single-task models in discriminative and generative downstream tasks (incl. cross-modal matching, multi-modal recognition, and image-to-text generation) with significant parameter saving. Finally, we explore how to use natural language to create fine-grained visual content – 3D head avatars. Building upon the foundation of 2D text-to-image diffusion models, we enhance the diffusion process by incorporating 3D awareness of head priors and enable fine-grained editing through the proposed identity-aware score distillation method, resulting in superior fidelity and editing capabilities.

Files and links (1)

pdf

PhD_Thesis_Xiao_Han_revised65.76 MBDownload View

PDFCC BY-NC-SA V4.0, Open Access

Metrics

11 File views/ downloads

34 Record Views

Details

Title: Fine-Grained Vision-Language Learning
Creators: Xiao Han - University of Surrey, School of Computer Science and Electronic Engineering
Contributors: Tao Xiang (Supervisor) - University of Surrey, School of Computer Science and Electronic Engineering
Awarding Institution: University of Surrey; Doctor of Philosophy (PhD)
Theses and Dissertations: Doctor of Philosophy (PhD), University of Surrey
Publisher: University of Surrey
Number of pages: 129
Grant note: iFlyTek
Identifiers: 99898266502346
Academic Unit: School of Computer Science and Electronic Engineering
Resource Type: Doctoral Thesis

Fine-Grained Vision-Language Learning

Abstract

Files and links (1)

Metrics

Details

Usage Policy