Abstract
The field of generative AI has progressed at a rapid pace and can now produce high-quality images and
videos from text prompts. This evolution has also led to greater user demand for precise control over the
outcomes, posing new challenges in effectively directing generation processes. Standard conditioning
techniques, mainly using text and image inputs, have proven useful but remain limited in handling
more complex requirements, such as specific human pose, camera orientation or fine-grained visual
appearance. This PhD research enhances conditioning techniques by introducing a parametric approach
that emphasises multimodal conditioning for image and video generation models. It focuses on
developing methods to enable more comprehensive user control, incorporating various modalities such
as pose and spatial inputs to improve alignment between model capabilities and user intentions across
different aspects of generation. By refining the conditioning mechanisms, this research aims to bridge the
gap between user specifications and model outputs, ensuring greater flexibility, precision, and coherence
in generated content.
This thesis presents several key contributions. Traditional methods for human pose conditioning,
which rely on skeleton images, contain substantial redundancy and are computationally inefficient for
modern architectures. To address this, we proposed the concept of pose token, where raw pose parameters
are compressed into tokens that can be used as conditioning elements via attention mechanisms, a common
approach in advanced architectures. We validated this token-based approach with both 2D body keypoints
and 3D body parameters, demonstrating its effectiveness across multiple architectures, from transformers
to diffusion models. Additionally, our parametric approach introduces groundbreaking techniques for
human and camera pose interpolation within image generation.
A common approach for conditioning diffusion models involves incorporating adapters - lightweight
models to deliver control signals to pre-trained image models. However, our research has revealed that
this method often introduces a critical issue of mode conflict. This problem, worsened by cascading
multiple adapters, results from an imbalance in control signals: the model can become dominated by one
adapter, limiting the generative power of both the base model and other adapters. Despite its prevalence,
this issue remains largely unaddressed in existing research. To solve this, we devised a unified adapter
architecture that integrates both structural and visual conditioning within a single, harmonised control
pathway. This unified approach delivers balanced multimodal conditioning, avoiding the pitfalls of
adapter cascade and enabling greater model flexibility. As a result, our approach’s high controllability
empowers versatile human image generation and editing tasks.
Our research in 2D image generation was extended to video generation. Our study demonstrated that
the architectural differences in transformer-based diffusion models make existing camera control methods
for U-Net-based diffusion models ineffective. Through extensive experimentation, optimal architectures
and camera representations were identified. Combined with our novel camera motion guidance, camera
control was restored for video diffusion transformers, with motion boosted by over 400%. Our research
on human pose conditioning for images extends to video generation. Unlike existing methods that require
detailed camera pose input for every frame, our approach achieves smooth video motion with minimal
input. By specifying only the initial and final camera poses, our system interpolates between frames to
produce continuous camera movements, enabling consistent, controlled video generation with reduced
data requirements. This sparse video conditioning approach significantly simplifies the user interaction
while ensuring fluid transitions and stable pose dynamics across frames, pushing the boundaries of
efficient and user-friendly video generation.
Many of the challenges we aimed to address were novel, often lacking established evaluation methods.
As a result, we proposed new evaluation metrics to rigorously assess these areas. One of these, People
Count Error (PCE), identifies a unique type of error specific to AI-generated human images, such as
inaccurate body part generation. This metric has already gained traction in the research community and
is being adopted in image generation benchmarks, helping to set new standards for evaluating AI-driven
human image quality.