Abstract
Despite recent advancements in camera control methods for U-Net based video diffusion models, these methods have been shown to be ineffective for transformer-based diffusion models (DiT). In this paper, we investigate the underlying causes of this issue and propose solutions. Our study reveals that camera control performance depends heavily on the choice of conditioning methods, rather than on camera pose representations , as is commonly believed. To address the persistent motion degradation in DiT, we introduce Camera Motion Guidance (CMG), a classifier-free guidance approach that boosts camera motion by over 400%. Additionally, we present a sparse camera control pipeline that improves training data efficiency and simplifies the process of specifying camera poses for long videos. Project page at https://soon-yau.github.io/ CameraMotionGuidance.