Abstract
In the field of media production, video editing techniques play a pivotal role. Recent
approaches have had great success at performing novel view image synthesis of static
scenes. But adding temporal information adds an extra layer of complexity. Previous
models have focused on implicitly representing static and dynamic scenes using NeRF.
These models achieve impressive results but are costly at training and inference time.
They overfit an MLP to describe the scene implicitly as a function of position. This
paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new
scenes without retraining. We can accurately reconstruct novel views using multi-view
synthesis techniques and scene flow-field estimation, trained only with unrelated scenes.
We demonstrate how existing state-of-the-art approaches from a range of fields cannot
adequately solve this new task and demonstrate the efficacy of our solution. The resulting
network improves quantitatively by 15% and produces significantly better visual results.