Instructional Video Generation

COG Research Group, University of Michigan

Given an input image as context and a text prompt describing the action, our proposed two-stage diffusion-based model generates high-quality instructional video frames that are consistent with the given instruction. Our model excels at capturing (i) large hand positional motion without background hallucinations, (ii) object state changes, and (iii) precise fingertip motions, ensuring clarity and focus in cluttered instructional video scenes.

More samples and comparision between methods are shown in the gallery.

Abstract

Despite the recent strides in video generation, state-of-the-art methods still struggle with elements of visual detail. One particularly challenging case is the class of egocentric instructional videos in which the intricate motion of the hand coupled with a mostly stable and non-distracting environment is necessary to convey the appropriate visual action instruction. To address these challenges, we introduce a new method for instructional video generation. Our diffusion-based method incorporates two distinct innovations. First, we propose an automatic method to generate the expected region of motion, guided by both the visual context and the action text. Second, we introduce a critical hand structure loss to guide the diffusion model to focus on smooth and consistent hand poses. We evaluate our method on augmented instructional datasets based on EpicKitchens and Ego4D, demonstrating significant improvements over state-of-the-art methods in terms of instructional clarity, especially of the hand motion in the target region, across diverse environments and actions.

Problem Setting

Instructional Video Example 1 Instructional Video Generation (IVG):. The inputs are an image providing visual context and an action text prompt describing the task to be demonstrated. The outputs are generated video frames showing the action through detailed hand motion. Challenges include cluttered backgrounds and subtle, task-specific hand movements.

Method Overview

Method Structure Overview Our method addresses the image-text to video generation problem for instructional content with a two-stage, backbone-shared approach. In Stage one, the model automatically predicts the Region of Motion (RoM)—the spatial area in the input image where task-relevant motion occurs. In Stage two, conditioned on this RoM, the model generates instructional video frames that focus on the action, avoiding distractions from cluttered backgrounds. Additionally, we introduce a hand structure loss, which ensures accurate and precise hand motions, critical for capturing subtle and task-specific fingertip movements, thereby enhancing the quality and clarity of instructional videos.

Qualitative Results

Visit Gallery

Quantitative Results

LoWRA Bench Details Quantitative results on EpicKitchens, Ego4D and a Motion Intensive subset of EpicKitchens. Our method outperforms all baselines across all metrics on at least one dataset

Ablation Study

Ablation study on the EpicKitchens dataset, examining the impact of Region of Motion (RoM) mask generation (“Mask”) and Hand Structure Loss (“Hand”) on model performance. Each component is evaluated individually and in combination, demonstrating their contributions to improved visual quality and consistency across frames and videos. LoWRA Bench Details

BibTeX

@misc{li2024instructionalvideogeneration,
        title={Instructional Video Generation}, 
        author={Yayuan Li and Zhi Cao and Jason J. Corso},
        year={2024},
        eprint={2412.04189},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.04189}, 
  }