Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Frame In-N-Out: Unlocking the Unbounded Canvas

Key Contributions: (a) The first work to explore the Frame In and Frame Out pattern in controllable video generation. (b) A curated dataset and accompanying metadata are introduced to support this setting. (c) A novel video Diffusion Transformer is proposed, efficiently handling pixel-aligned motion and unaligned identity references for stable generation. (d) A new evaluation protocol, including testing data and metrics, is developed to assess Frame In and Frame Out performance, with practical applications in filmmaking and advertising.

Overview

Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

Data Curation

We design a fully automatic data curation pipeline to support the Frame In and Frame Out video generation paradigm. Starting from raw videos, our pipeline can curate a dataset rich in text prompt, ID references with corresponding motion trajectories, and partitions between the first frame and canvas region for Frame In and Frame Out pattern.

Architecture

We propose a video Diffusion Transformer architecture that unifies spatiotemporal pixel-aligned motion conditioning and unaligned idenntity (ID) refernece conditioning for Frame In and Frame Out generation. We adopt a two-stage training procedure. In the first stage, we train with motion control and text prompts to learn conditioning alignment. In the second stage, we incorporate Frame In and Frame Out data with unbounded canvas support and identity reference images.

Generated Video Samples

Citation


  @article{wang2025frameinnout,
        title={Frame In-N-Out: Unbounded Controllable Image-to-Video Generation}, 
        author={Boyang Wang and Xuweiyi Chen and Matheus Gadelha and Zezhou Cheng},
        year={2025},
        eprint={arXiv:2505.21491},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.21491}, 
  }