Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

University of Virginia1 Adobe Research2

Frame In-N-Out: Unlocking the Unbounded Canvas

Key Contributions: (a) The first work to explore the Frame In and Frame Out pattern in controllable video generation. (b) A curated dataset and accompanying metadata are introduced to support this setting. (c) A novel video Diffusion Transformer is proposed, efficiently handling pixel-aligned motion and unaligned identity references for stable generation. (d) A new evaluation protocol, including testing data and metrics, is developed to assess Frame In and Frame Out performance, with practical applications in filmmaking and advertising.

Overview

Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

Data Curation

We design a fully automatic data curation pipeline to support the Frame In and Frame Out video generation paradigm. Starting from raw videos, our pipeline can curate a dataset rich in text prompt, ID references with corresponding motion trajectories, and partitions between the first frame and canvas region for Frame In and Frame Out pattern.

Frame In-N-Out Data Curation Pipeline

(a) High-quality videos are filtered by metadata, image quality, scene cuts, and camera motion detection. (b) Panoptic segmentation identify motionable objects from key frames. (c) Robust tracking with CoTracker3 ensures trajectory accuracy, followed by bounding box regression to define Frame In or Frame Out cases. (d) Bounding Boxes with arbitrary aspect ratio and size is randomly regressively generated to find ideal parition between the First Frame and the Canvas Region for training.

Architecture

We propose a video Diffusion Transformer architecture that unifies spatiotemporal pixel-aligned motion conditioning and unaligned idenntity (ID) refernece conditioning for Frame In and Frame Out generation. We adopt a two-stage training procedure. In the first stage, we train with motion control and text prompts to learn conditioning alignment. In the second stage, we incorporate Frame In and Frame Out data with unbounded canvas support and identity reference images.

Frame In-N-Out Architecture

Training Architecture. Our video Diffusion Transformer takes as input the first frame expanded canvas, motion condition, identity reference, and text prompt. These are encoded by a shared Causal 3D VAE and mutually combined along both channel-wise and frame-wise dimensions with needed padding to form a unified and efficient conditioning architecture for the image-to-video generation.

Frame In-N-Out Architecture

Inference Architecture. The first frame is first expaneded to the target canvas size. The initial noisy latent has the same size as the canvas region. Motion trajectories can be explored on the full canvas field. The ID reference image is scaled and padded to the same size as the canvas region. Generated videos in decoded in pixel space will be cropped to the same size as the first frame region.

Generated Video Samples

Fishing Boat
Men Walking
balloon_flying
Men Walking
balloon_flying
Men Walking
balloon_flying
Men Walking
Fishing Boat
Men Walking

Citation


  @article{wang2025frameinnout,
        title={Frame In-N-Out: Unbounded Controllable Image-to-Video Generation}, 
        author={Boyang Wang and Xuweiyi Chen and Matheus Gadelha and Zezhou Cheng},
        year={2025},
        eprint={arXiv:2505.21491},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.21491}, 
  }