Multimodal Conditioning for Controllable Image and Video Generation

Soon Yau Cheong

doi:10.15126/thesis.901533

The field of generative AI has progressed at a rapid pace and can now produce high-quality images and videos from text prompts. This evolution has also led to greater user demand for precise control over the outcomes, posing new challenges in effectively directing generation processes. Standard conditioning techniques, mainly using text and image inputs, have proven useful but remain limited in handling more complex requirements, such as specific human pose, camera orientation or fine-grained visual appearance. This PhD research enhances conditioning techniques by introducing a parametric approach that emphasises multimodal conditioning for image and video generation models. It focuses on developing methods to enable more comprehensive user control, incorporating various modalities such as pose and spatial inputs to improve alignment between model capabilities and user intentions across different aspects of generation. By refining the conditioning mechanisms, this research aims to bridge the gap between user specifications and model outputs, ensuring greater flexibility, precision, and coherence in generated content. This thesis presents several key contributions. Traditional methods for human pose conditioning, which rely on skeleton images, contain substantial redundancy and are computationally inefficient for modern architectures. To address this, we proposed the concept of pose token, where raw pose parameters are compressed into tokens that can be used as conditioning elements via attention mechanisms, a common approach in advanced architectures. We validated this token-based approach with both 2D body keypoints and 3D body parameters, demonstrating its effectiveness across multiple architectures, from transformers to diffusion models. Additionally, our parametric approach introduces groundbreaking techniques for human and camera pose interpolation within image generation. A common approach for conditioning diffusion models involves incorporating adapters - lightweight models to deliver control signals to pre-trained image models. However, our research has revealed that this method often introduces a critical issue of mode conflict. This problem, worsened by cascading multiple adapters, results from an imbalance in control signals: the model can become dominated by one adapter, limiting the generative power of both the base model and other adapters. Despite its prevalence, this issue remains largely unaddressed in existing research. To solve this, we devised a unified adapter architecture that integrates both structural and visual conditioning within a single, harmonised control pathway. This unified approach delivers balanced multimodal conditioning, avoiding the pitfalls of adapter cascade and enabling greater model flexibility. As a result, our approach’s high controllability empowers versatile human image generation and editing tasks. Our research in 2D image generation was extended to video generation. Our study demonstrated that the architectural differences in transformer-based diffusion models make existing camera control methods for U-Net-based diffusion models ineffective. Through extensive experimentation, optimal architectures and camera representations were identified. Combined with our novel camera motion guidance, camera control was restored for video diffusion transformers, with motion boosted by over 400%. Our research on human pose conditioning for images extends to video generation. Unlike existing methods that require detailed camera pose input for every frame, our approach achieves smooth video motion with minimal input. By specifying only the initial and final camera poses, our system interpolates between frames to produce continuous camera movements, enabling consistent, controlled video generation with reduced data requirements. This sparse video conditioning approach significantly simplifies the user interaction while ensuring fluid transitions and stable pose dynamics across frames, pushing the boundaries of efficient and user-friendly video generation. Many of the challenges we aimed to address were novel, often lacking established evaluation methods. As a result, we proposed new evaluation metrics to rigorously assess these areas. One of these, People Count Error (PCE), identifies a unique type of error specific to AI-generated human images, such as inaccurate body part generation. This metric has already gained traction in the research community and is being adopted in image generation benchmarks, helping to set new standards for evaluating AI-driven human image quality.

Multimodal Conditioning for Controllable Image and Video Generation

Abstract

Files and links (1)

Metrics

Details

Multimodal Conditioning for Controllable Image and Video Generation

Abstract

Files and links (1)

Metrics

Details

Usage Policy