Ablation Study on Hand Structure Loss (HSL)
Input Image
Without HSL
With HSL
Input Action Description: "Pour vinegar into bowl."
Input Action Description: "Julienne carrot."
Input Action Description: "Stir the pasta."
Input Action Description: "Wash fruit."
This page presents an ablation study of our proposed hand structure loss, corresponding to Table 2 (No. 4 and No. 5) in the main paper (https://arxiv.org/abs/2412.04189). Each row shows one sample, displaying the input context image (left), results from the model without hand structure loss (middle), and results from the model with hand structure loss (right). These results demonstrate that the hand loss helps generate clearer, more natural, and more consistent hand structures in the generated instructional videos, even for dexterous hand motion.