• Thu. Apr 16th, 2026

How we tested and evaluated AI-generated dance videos

How we tested and evaluated AI-generated dance videos

By Mohamed Al Elew, Khari Johnson and Levi Sumagaysay, CalMatters

"A
Zion Harris, center, rehearses for Jeté, a monthly dance showcase at Heart WeHo in West Hollywood in Los Angeles, on Sept. 19, 2024. Photo by Alisha Jucevic for CalMatters

This story was originally published by CalMatters. Sign up for their newsletters.

Artificial intelligence models can produce lifelike video footage with a simple text prompt. But these tools still struggle with generating realistic videos of complex natural movements, like human dance. 

When CalMatters and The Markup asked dancers and choreographers about whether AI could disrupt their industry, most concluded that human dancers could not be replaced.

For the most part, we found that they were right. We tested nine different cultural, modern and popular dance styles using four commercially available generative AI video models, generating a total of 36 videos. We found that the latest commercially available AI video generation models produced convincingly lifelike videos of people dancing — but none produced a figure performing the prompted dance.

About a third of the generated videos exhibited inconsistencies in a subject’s appearance from frame to frame, along with abnormalities in movement and limbs. The frequency and magnitude of issues observed were a significant improvement compared to initial testing in late 2024.

Methodology

Define task

CalMatters and The Markup tested four commercial video generation models produced by major tech companies to create video clips of traditional and popular dance.

We limited our tests to consumer-facing, closed-source generative video tools because they are the most readily available for everyday users and tend to perform better than open-source models. We tested Sora 2 by OpenAI, Veo 3.1 by Google, Kling 2.5 by Kuaishou, and Hailou 2.3 by MiniMax.

Prepare prompts

We drafted nine video prompts testing a variety of dances in different settings, such as dance floors, stages, bedrooms, studios, cultural events, public squares and classrooms. We tested for popular, modern and traditional cultural dance styles, including the Macarena, the Mashed Potato, folklorico and popular TikTok dances. See the Appendix for more details.

We varied the level of specificity to test whether identifying the dance by name was enough to generate a video of the desired motion, or whether explicitly specifying the exact physical movements improved output. 

Before finalizing the list of prompts, we submitted them to ChatGPT for edits based on the Sora 2 Prompting Guide. See Limitations: Prompt Optimization for more details.

Submit prompts for video generation

Each prompt was submitted once, using each model’s default settings for generating landscape-oriented videos. Three prompts submitted to Sora 2 were edited to remove words that triggered OpenAI’s filter, blocking prompts that may have violated “guardrails concerning similarity to third-party content.” For example, Sora 2 flagged prompts referencing specific years, popular music artists and banned words. One blocked prompt was for a video of a politician dancing the Macarena. For that prompt, replacing “politician in a suit” with “man in a suit” bypassed the guardrail. Veo 3.1 flagged similar prompts when we submitted via Gemini or Flow but did not when we submitted directly to the Veo 3.1 API.

Evaluate generated videos

We evaluated the generated videos on six different criteria related to prompt alignment and video consistency:

  1. Did the main subject dance in any way?
  2. Did the main subject perform the specific dance we prompted for?
  3. Did the main subject maintain the same physical appearance throughout the video?
  4. Did the main subject produce realistic motions based on human physiology?
  5. Did the scene and setting match the prompt?
  6. Did the camera match the prompted camera angle and position?

Each of the above criteria was assessed as a pass or a fail by a single reviewer, with the assistance of a second reviewer when needed. The generated videos of cultural dances were reviewed for accuracy by dancers familiar with them.

Results

Of the 36 videos generated, all but one showed a figure dancing. The one video that did not show a dancing figure — produced by Kling 2.5 — instead presented the bottom half of a figure performing side lunges.

No video produced the actual dance we prompted for. For the Cahuilla Band of Indians bird dance, tribal member Emily Clarke said, “None of these depictions are anywhere close to bird dancing, in my opinion.” The videos for the Horton dance did not show the specific dance movement we prompted for, but choreographer Emma Andre said she found the depiction by Veo 3.1 to be “staggeringly lifelike.” 

For the remaining pop culture dances, we compared the generated videos to videos we found on YouTube to evaluate whether the dance was accurate.

11 out of 36 videos exhibited issues with either motion or appearance consistency. This included sudden changes in clothing, hair or limb structure, such as heads rotating on separate axes from their bodies and limbs liquefying and reconstituting.

See the Appendix for full results and videos.

Limitations

Image-to-video generation

We did not use images to prompt the models. Image-to-video generation involves uploading a static image along with a text prompt, producing a dynamic video from both. Image-to-video generation is an advertised use case for models that produce dance videos from user-submitted images. 

Multi-subject dance videos

We did not prompt for videos with multiple dancers, even though some of the dances are often performed in groups. We limited our video prompts to showcase a single dancer to avoid ambiguity around whether a failed evaluation was due to issues with generating complex human movement or a realistic multi-subject video.

Prompt optimization

We did not optimize prompts on a per-model basis. Each company publishes its own prompt guide. (See the guidelines for Veo 3.1, Hailou 2.5, Kling 2.3, and Sora 2.) Instead, we used ChatGPT 5 to standardize prompts across models to align with the Sora 2 Prompting Guide. It’s possible that optimizing prompts for the models according to their specific guides could have yielded more accurate results.

We also tried to improve the quality of the videos by giving detailed step-by-step instructions of each dance. However, these instructions did not produce videos that were any more accurate than those produced with simpler prompts.

Human-motion generation models

We did not test generative models focused on human-motion generation. These models are used in animation and video games to generate and capture natural human motion. Researchers train some state-of-the-art academic models in this space using large datasets, including footage of popular dances on TikTok. Although these models may perform better than the consumer-facing models we tested, they require technical expertise and substantial computational resources to run.  

Sample size

Our evaluation is limited to the generated videos for nine prompts; it is not a comprehensive assessment of the models used. Some video generation benchmarks, like those from Tencent’s AI Lab and others, use several hundred prompts to test capabilities such as complex motion, multiple subjects and creative style.

Acknowledgements

We thank Yuhang Yang (University of Science and Technology of China) and Xiaodong Cun (Great Bay University) for reviewing an early draft of this methodology.

Appendix

View evaluations by prompt or evaluations by model.

Evaluations by prompt


link

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *