HANDI -- Qualitative Results on Baseline Comparison

By comparing with all baselines, we emphasize three key capabilities of our method that are accountable for the two key designs--Motion Area (MA) Generation and Hand Refinement Loss (HRL). Specifically, we observe good performance on generating Hand-Centric videos for actions require (i) intricate fingertip motion; (ii) large Motion Area; (iii) Object State Changing. Our proposed learnable automatic Motion Aera mask generation helps the model to focus on the accurate region for the action execution (``ii'' and ``iii''), avoiding adding distraction to background. Our novel Hand Refinement Loss helps the model handle the actions with critical details (``i''), which is under-explored in previousw ork. In this gallery, for each sample, the first row displays the input context image and the results from different methods. In the second row, we first present the predicted Motion Area mask from our Stage 1 model. Then in the same row, we show the motion flow below generated video from each method. Note that the motion flow is not part of the model's outputs; it is shown solely to facilitate easier comparison among methods. Below the visuals, we provide the input Text Prompt for the target action (e.g., "Pick up and crack egg.") and the Enriched Action Description (e.g., "The person uses their left hand to pick up an egg from the egg box and cracks it into a bowl.").

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow SORA

"Knit the fabric." → "The person uses the crochet in the right hand to knit the fabric held in the left hand."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Wash fruit." → "The person holds the fruit in the left hand and continues to wash it under the running tap using the right hand."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow SORA

"Throw paper into bin." → "The person uses the left hand to open the bin and throw the paper into the bin using the right hand."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Julienne carrot." → "The person holds a carrot on the chopping board with the left hand and uses a knife in the right hand to julienne the carrot."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Roll dough." → "The person uses both hands to roll the dough into a ball."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Shake soy milk." → "The person holds a container of soy milk in the left hand and shakes it vigorously."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Cut eggplant." → "The person holds a cucumber with the left hand and uses a knife in the right hand to cut the cucumber."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow SORA

"Pour vinegar into bowl." → "The person holds a bottle of vinegar in the left hand and pours it into a bowl, adding it to the mixture."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Stir the pasta." → "The person holds a wooden spoon in the right hand and continues to stir the pasta in the pan on the hob, while using the left hand to support the pan."

Input Image/Generated Motion Area Mask

Ours

DynamiCrafter

PIA

Animate Anything

AVDC

LFDM

CogVideoX

Open SORA

Input Image

Mask Image

Flow Ours

Flow DynamiCrafter

Flow PIA

Flow Animate Anything

Flow AVDC

Flow LFDM

Flow CogVideoX

Flow Open Sora

"Drop garlic into fridge." → "The person use the right hand to open the fridge and drops a pack of garlic from the left hand into the fridge."