Qualitative Results

By comparing with all baselines, we emphasize three key capabilities of our method that are accountable for the two key designs. Specifically, we observe good performance on generating instructional videos for actions with (i) large hand positional motion (without hallucination in the background); (ii) object state changing; (iii) subtle dexterous fingertip motion. Our proposed learnable automatic Region of Motion mask generation helps the model to focus on the accurate region for the action (``i'' and ``ii''), avoiding distraction from the cluttered background commonly seen in instructional videos. Our novel hand structure loss helps the model handle the actions with complex but essential fingertip motion (``iii''), which is rarely seen in previous non-instructional video generation benchmarks. For each sample, the first row displays the input context image and the results from different methods. In the second row, we first present the predicted Region of Motion mask from our Stage One model. Additionally, we show the motion flow below each generated video in the second row. Note that the motion flow is not part of the model’s outputs; it is shown solely to facilitate easier comparison among methods. Finally, we provide the Input Target Action Prompt (e.g., “Pick up and crack egg.”) and the Enriched Action Description (e.g., “The person uses their left hand to pick up an egg from the egg box and cracks it into a bowl.”).
Large Hand Positional Motion
Input Image
LFDM
AVDC
PIA
SORA
Animate Anything
Ours
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Pick up and crack egg." "The person uses the left hand to pick up an egg from the egg box and cracks it into a bowl."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Throw paper into bin." "The person uses the left hand to open the bin and throw the paper into the bin using the right hand."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Julienne carrot." "The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Roll dough." "The person uses both hands to roll the dough into a ball."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Shake soy milk." "The person holds a container of soy milk in the left hand and shakes it vigorously."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Cut eggplant." "The person holds a cucumber with the left hand and uses a knife in the right hand to cut the cucumber."
Subtle Dexterous Fingertip Motion
Input Image
LFDM
AVDC
PIA
SORA
Animate Anything
Ours
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Peel garlic." "The person holds a garlic clove in the left hand and uses the right hand to peel the garlic."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Wash fruit." "The person holds the fruit in the left hand and continues to wash it under the running tap using the right hand."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Knit the fabric." "The person uses the crochet in the right hand to knit the fabric held in the left hand."
Object State Changing
Input Image
LFDM
AVDC
PIA
SORA
Animate Anything
Ours
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Pour vinegar into bowl." "The person holds a bottle of vinegar in the left hand and pours it into a bowl, adding it to the mixture."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Stir the pasta." "The person holds a wooden spoon in the right hand and continues to stir the pasta in the pan on the hob, while using the left hand to support the pan."
Input Image
Mask Image Flow LFDM Flow AVDC Flow PIA Flow PIA Flow Animate Anything Flow Ours
"Drop garlic into fridge." "The person use the right hand to open the fridge and drops a pack of garlic from the left hand into the fridge."