nusdufv/text-2-video-human-preferences-motion
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nusdufv/text-2-video-human-preferences-motion
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "Human Preference Data for AI Video Generation — Motion Quality (29K Labels, 4 Models)"
language:
- en
license: cc-by-4.0
size_categories:
- 10K<n<100K
task_categories:
- video-classification
- text-to-video
- reinforcement-learning
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
tags:
- human-preferences
- video-generation
- preference-data
- human-motion
- rlhf
- reward-model
- text-to-video
- video-quality
- pairwise-comparison
- annotation
- video-evaluation
- video-benchmark
- dpo
- human-feedback
- ai-video
- generative-ai
- sora
- veo
- kling
- grok
- luma
- coherence
- aesthetics
- prompt-adherence
- motion-quality
- temporal-consistency
- video-reward-model
- preference-learning
- video-rlhf
---
# Human Preferences for AI-Generated Video: Motion Quality
<p align="left">
<img src="https://huggingface.co/datasets/datapointai/text-2-video-human-preferences-motion/resolve/main/datapointlogo.png" alt="Datapoint AI" width="300">
</p>
**29,283 pairwise human preference labels** comparing **4 frontier video generation models** on human motion across **3 quality dimensions**, collected from **4,349 real annotators** via [Datapoint AI](https://trydatapoint.com).
This is the largest publicly available human preference dataset focused specifically on **human motion in AI-generated video**.
## Why This Dataset
Video generation models are improving fast, but **evaluating human motion remains unsolved**. Automated judges (VLMs like GPT-4V, Gemini) miss subtle errors in gait, facial expressions, and multi-body coordination that humans catch easily.
This dataset gives you **ground-truth human preferences** you can use to:
- **Train video reward models** for RLHF / DPO / preference optimization
- **Benchmark video generation models** on realistic human motion
- **Calibrate VLM judges** — measure where automated evaluators disagree with humans
- **Study annotation patterns** — inter-annotator agreement, position bias, response time distributions
## Models Compared
| Model | Type |
|---|---|
| **Grok Imagine** | xAI's video generation model |
| **Veo 3 Fast** | Google DeepMind |
| **Kling 1.5 Pro** | Kuaishou |
| **Luma Ray 2** | Luma Labs |
## Dataset Structure
354 aggregated comparison rows (from 29,283 individual annotations). Each row = one pairwise comparison between two model outputs for the same prompt.
| Field | Description |
|---|---|
| `prompt` | Text prompt used to generate both videos |
| `video1` / `video2` | GIF previews of the generated videos |
| `model1` / `model2` | Which model generated each video |
| `weighted_results1_Coherence` | Fraction of annotators preferring video 1 on coherence |
| `weighted_results2_Coherence` | Fraction preferring video 2 on coherence |
| `weighted_results1_Aesthetic` | Fraction preferring video 1 on aesthetics |
| `weighted_results2_Aesthetic` | Fraction preferring video 2 on aesthetics |
| `weighted_results1_Prompt_Adherence` | Fraction preferring video 1 on prompt faithfulness |
| `weighted_results2_Prompt_Adherence` | Fraction preferring video 2 on prompt faithfulness |
| `detailedResults_*` | Per-annotator votes with timestamps |
| `subcategory` | Motion type: walking, dancing, talking, sports, stationary, multi-person |
| `prompt_id` | Unique prompt identifier (1–60) |
## Evaluation Dimensions
| Dimension | What annotators judged |
|---|---|
| **Coherence** | Temporal consistency — no flickering, warping, deformation, or physically implausible motion |
| **Aesthetic** | Visual quality — composition, lighting, color, style, production value |
| **Prompt Adherence** | Accuracy — does the video depict what the prompt describes? |
## Motion Categories
| Category | Examples | Why it's hard for AI |
|---|---|---|
| **Walking / Running** | Gaits, jogging, sprinting | Weight shift, foot contact, natural rhythm |
| **Dancing** | Ballet, hip-hop, folk | Complex coordinated movement, full-body flow |
| **Talking / Expressions** | Speaking, singing, laughing | Lip sync, facial micro-movements |
| **Sports / Action** | Martial arts, skateboarding | Fast motion, physics, athletic poses |
| **Stationary** | Meditating, reading, posing | Subtle motion, identity preservation over time |
| **Multi-Person** | Handshakes, sparring, group performance | Two+ bodies, occlusion, interaction physics |
## Key Results
### Overall Win Rates
| Rank | Model | Win Rate | 95% CI |
|---|---|---|---|
| 1 | **Grok Imagine** | 54.7% | [54.0%, 55.5%] |
| 2 | **Veo 3 Fast** | 54.6% | [53.8%, 55.3%] |
| 3 | **Kling 1.5 Pro** | 47.9% | [47.1%, 48.7%] |
| 4 | **Luma Ray 2** | 42.8% | [42.0%, 43.6%] |
### By Dimension
| Model | Coherence | Aesthetic | Prompt Adherence |
|---|---|---|---|
| Grok Imagine | 53.6% | **55.7%** | 54.7% |
| Veo 3 Fast | 54.5% | 54.7% | 54.5% |
| Kling 1.5 Pro | 48.4% | 48.0% | 47.4% |
| Luma Ray 2 | 43.5% | 41.5% | 43.5% |
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("datapointai/text-2-video-human-preferences-motion")
print(ds["train"][0])
```
### Train a reward model
```python
import pandas as pd
from datasets import load_dataset
ds = load_dataset("datapointai/text-2-video-human-preferences-motion", split="train")
df = ds.to_pandas()
# Each row is a comparison — use weighted scores as soft labels
for _, row in df.iterrows():
prompt = row["prompt"]
score_a_coherence = row["weighted_results1_Coherence"]
score_b_coherence = row["weighted_results2_Coherence"]
# Use as preference pairs for DPO, reward modeling, etc.
```
## Data Quality
| Metric | Value |
|---|---|
| Total annotations | 29,283 |
| Unique annotators | 4,349 |
| Unique prompts | 60 |
| Pairwise comparisons | 354 |
| Annotations per comparison | ~28 (median) |
| Median response time | 14.9 seconds |
| Position bias | 52.8% left / 47.2% right (near 50/50) |
**Position bias control**: Videos were randomly shuffled between left/right for each comparison. Observed selection rate is near the 50/50 baseline.
**Engagement verification**: Median 14.9s response time confirms annotators watched both videos (each 4–5 seconds) before deciding.
**Annotator diversity**: 4,349 unique annotators with a median of 4 labels each — broad perspectives, low individual bias.
## Methodology
- **60 prompts** generated with structured diversity across motion categories
- **4 models** evaluated via Fal.ai API (single inference, no cherry-picking)
- **All videos** are 4–5 seconds, 540p–720p, 16:9
- **Mobile-first annotation** through Datapoint AI's consumer app SDK
- **Forced-choice** pairwise comparison with dimension-specific questions
- **Dawid-Skene aggregation** available for consensus estimation
## Compared to Other Datasets
| Dataset | Labels | Focus | Models | Dimensions |
|---|---|---|---|---|
| **This dataset** | **29,283** | **Human motion** | **4 frontier (2025)** | **3** |
| Rapidata text-2-video | 2,570 | General video | 4 | 3 |
| VideoGen-Eval | ~5,000 | General video | 6 | 1 |
## Get Custom Human Preference Data
Need preference labels for **your** model, domain, or evaluation criteria?
Datapoint AI runs the same annotation pipeline used to create this dataset — but customized to your specs:
- **Your models** — any video, image, or text generation model
- **Your prompts** — domain-specific evaluation sets
- **Your dimensions** — custom quality criteria beyond coherence/aesthetics/adherence
- **Scale** — from 1K to 1M+ labels, median 24-hour turnaround
- **No professional annotator bias** — real users in a consumer app, not Mechanical Turk
🎓 **First dataset free for university researchers and early-stage startups.**
👉 **[Get started at trydatapoint.com](https://trydatapoint.com)** or email **sales@trydatapoint.com**
## Citation
```bibtex
@dataset{datapointai_vidprefmotion_2026,
title={Human Preference Data for AI Video Generation: Motion Quality},
author={Datapoint AI},
year={2026},
url={https://huggingface.co/datasets/datapointai/text-2-video-human-preferences-motion},
note={29,283 pairwise human preference labels for AI-generated human motion video}
}
```
## License
CC-BY-4.0 — free for research and commercial use with attribution.
## About Datapoint AI
[Datapoint AI](https://trydatapoint.com) collects human preference data at scale through a mobile-first annotation pipeline embedded in consumer apps. We replace mobile ads with data labeling tasks — real users, real preferences, no professional annotator bias.
For custom evaluation studies, higher-scale labeling, or API access: **[trydatapoint.com](https://trydatapoint.com)**
提供机构:
nusdufv



