nvidia/LongGroundedThoughts-video-datagen
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nvidia/LongGroundedThoughts-video-datagen
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- video-understanding
- mcq-generation
- chain-of-thought
- temporal-reasoning
- multimodal
language:
- en
size_categories:
- 100K<n<1M
---
# LongGroundedThoughts — Video Data Generation Pipeline
Generate temporally-grounded multiple-choice questions (MCQs) with chain-of-thought reasoning from video datasets. Based on [Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale](https://arxiv.org/abs/2511.05705).
## Pipeline Overview
```
Stage 0 (Observe) → Extract video events: speech (Whisper), scene cuts, motion
↓
Stage 1 (Ask) → Generate MCQs: direct extraction + event-grounded generation
↓
Stage 2 (Think) → Simple CoTs: Qwen2.5-VL-Instruct (10 per MCQ)
↓
Stage 3 (Think More) → Extended CoTs: DeepSeek-R1-Distilled-Qwen-32B
↓
Stage 4 (Ground) → Event-Aware CoTs: reasoning with temporal evidence
↓
→ SFT + DPO training datasets
```
## Key Features
- **5 video datasets**: LLaVA-Video-178K, NExT-QA, CLEVRER, PE-Video, Ego4D
- **196K MCQs** extracted from LLaVA-Video-178K (all 8 duration/source configs)
- **Event-grounded MCQ generation** using Stage 0 metadata (speech, scene cuts, motion)
- **MCQ rewriting** — enhances existing questions with temporal grounding
- **Diverse question types** enforced per video: temporal ordering, speech-visual alignment, scene transition, cause-effect, state change, audio-visual
- **Event-aware reasoning traces** — Stage 4 injects temporal evidence into CoT chains
- **SFT + DPO** output formats compatible with LLaMA-Factory
## Quick Start
```bash
# 1. Setup conda environment
conda create -n qwen_omni python=3.10
conda activate qwen_omni
pip install torch vllm pandas datasets tqdm jinja2 fire pandarallel
pip install scenedetect[opencv] openai-whisper llamafactory
# 2. Download annotations (all 8 configs, ~2GB)
python main.py download_llava_video \
--output_dir=/workspace/data/llava_video --config_name=all
# 3. Download videos (~890GB from HuggingFace archives)
bash download_videos.sh
# 4. Stage 0: Extract video events (GPU recommended for Whisper)
python stage_0_video_analysis/extract_events.py \
--output_dir outputs/video_events \
--whisper_model large-v3
# 5. Stage 1: Extract MCQs + generate event-grounded MCQs
python main.py generate_llava_video_mcq \
--data_dir /workspace/data/llava_video --config_name all
python -m stage_1_llava_video.generate_grounded_mcq \
--events_dir outputs/video_events --mode vl --action both
# 6. Stage 2: Simple CoT generation
python main.py generate_simple_cot long_grounded_thoughts_stage_1/llava_video_all
python main.py collect_simple_cot long_grounded_thoughts_stage_1/llava_video_all llava_video_all
# 7. Stage 3: Extended CoT generation
for PHRASE in "Wait," "Hmm," "Alternatively,"; do
python main.py generate_extended_cot "|${PHRASE}|" \
long_grounded_thoughts_stage_2_thought_expansion/llava_video_all
done
python main.py collect_extended_cot llava_video_all
# 8. Stage 4: Event-aware extended CoT
export VIDEO_EVENTS_DIR=outputs/video_events
for PHRASE in "Wait," "Hmm," "Alternatively,"; do
python main.py generate_event_aware_cot "|${PHRASE}|" \
long_grounded_thoughts_stage_2_thought_expansion/llava_video_all
done
python main.py collect_event_aware_cot llava_video_all
```
## Stage 0: Video Event Extraction
Extracts structured temporal metadata from each video:
| Component | Model | Output |
|-----------|-------|--------|
| Speech transcription | Whisper large-v3 | Timestamped dialogue with language detection |
| Scene boundaries | PySceneDetect | Cut timestamps |
| Motion analysis | Frame differencing | Per-segment activity level |
Hallucination filtering for Whisper: checks `no_speech_prob`, `avg_logprob`, and `compression_ratio` to discard phantom transcriptions on silent audio.
```bash
# GPU-accelerated (recommended, ~1s/video on A100)
CUDA_VISIBLE_DEVICES=1 python stage_0_video_analysis/extract_events.py \
--whisper_model large-v3
# CPU-only (~4s/video)
python stage_0_video_analysis/extract_events.py \
--whisper_model base
```
Output: `outputs/video_events/{video_id}.json`
```json
{
"video_path": "/path/to/video.mp4",
"duration": 38.1,
"scene_boundaries": [{"timestamp": 8.2}, {"timestamp": 15.1}],
"speech_segments": [
{"text": "Put it on the table", "start_t": 5.2, "end_t": 6.8, "language": "en"}
],
"segment_captions": [
{"text": "significant action (high motion, pixel_diff=42)", "start_t": 5.0, "end_t": 10.0}
]
}
```
## Stage 1: Event-Grounded MCQ Generation
Three actions:
| Action | Description |
|--------|-------------|
| `generate` | Create NEW MCQs from event metadata + video frames |
| `rewrite` | Enhance EXISTING MCQs with temporal grounding |
| `both` | Generate new + rewrite existing |
Question diversity is enforced per video — one question per available event type:
| Type | Requires | Example |
|------|----------|---------|
| `temporal_ordering` | Visual segments | "What happens AFTER the person picks up the broom?" |
| `speech_visual_alignment` | Speech | "What is happening when someone says 'put it down' at 5.2s?" |
| `scene_transition` | Scene cuts | "What changes after the scene cut at 12.3s?" |
| `cause_effect` | Events | "WHY does the person turn around?" |
| `state_change` | Visual segments | "How does the scene change between 0s and 30s?" |
| `audio_visual` | Audio events | "What sound is heard during the cooking scene?" |
Falls back to available types when metadata is sparse.
```bash
python -m stage_1_llava_video.generate_grounded_mcq \
--events_dir outputs/video_events \
--mode vl \
--action both \
--num_questions_per_video 5
```
## Stage 4: Event-Aware Extended CoT
Injects video event metadata into the reasoning chain before the cognitive continuation phrase:
```
[Simple CoT: "The person appears to be cooking based on the visible pots..."]
[Video event metadata (45s video):]
Speech: 5.2s-6.8s: "Add the salt now"
Visual: 15.0s-20.0s: significant action (high motion)
Wait, considering the speech at 5.2s where they say "add the salt now"
and the significant activity change at 15.0s, this confirms the person
is actively cooking and following a recipe...
```
The collect step compares accuracy WITH vs WITHOUT event context to measure the impact of temporal grounding.
## Project Structure
```
stage_0_video_analysis/ # Video event extraction
extract_events.py # Whisper + PySceneDetect + frame analysis
DESIGN.md # Research design document
stage_1_clevrer/ # CLEVRER MCQ generation
stage_1_ego4d/ # Ego4D MCQ generation
stage_1_llava_video/ # LLaVA-Video MCQ extraction
main.py # Direct MCQ conversion
generate_grounded_mcq.py # Event-grounded MCQ generation + rewriting
download.py # HuggingFace downloader
stage_1_nextqa/ # NExT-QA MCQ generation
stage_1_pe_video/ # PE-Video MCQ generation
stage_2_simple_cot/ # Simple CoT (Qwen2.5-VL-Instruct)
stage_3_expand_cot/ # Extended CoT (DeepSeek-R1-32B)
stage_4_event_aware_cot/ # Event-aware extended CoT
templates/ # Jinja2 prompt templates
main.py # CLI entry point (fire)
utils.py # Shared utilities
download_videos.sh # Download all 182 video archives (~890GB)
run_full_scale_pipeline.sh # End-to-end pipeline script
outputs/ # Generated data
llava_video_mcq_all.csv # 196K extracted MCQs
grounded_mcq.csv # Event-grounded MCQs
video_events/ # Stage 0 event JSONs
stage_2_simple_cot/ # Stage 2 intermediates
stage_4_event_aware_cot/ # Stage 4 intermediates
sft_*.json / dpo_*.json # Training datasets
```
## GPU Requirements
| Stage | Model | GPU Memory | Notes |
|-------|-------|-----------|-------|
| 0 | Whisper large-v3 | ~10 GB | ~1s/video on A100 |
| 1 (VL) | Qwen2.5-VL-7B | ~15 GB | TP=1 on A100 |
| 2 | Qwen2.5-VL-7B | ~15 GB | TP=1, processes video frames |
| 3 | DeepSeek-R1-32B | ~65 GB | TP=2, text-only |
| 4 | DeepSeek-R1-32B | ~65 GB | TP=2, text-only + event context |
## Environment Variables
```bash
export QWEN2_5_VL_INSTRUCT_PATH=/path/to/Qwen2.5-VL-7B-Instruct
export R1_DISTILLED_QWEN_32_B=/path/to/DeepSeek-R1-Distill-Qwen-32B
export LLAMAFACTORY_DIR=/path/to/LLaMA-Factory
export VIDEO_EVENTS_DIR=outputs/video_events
export VLLM_TENSOR_PARALLEL_SIZE=1
export VLLM_GPU_MEMORY_UTILIZATION=0.85
export VLLM_MAX_MODEL_LEN=16384
```
## Current Data Scale
| Dataset | MCQs | Videos |
|---------|------|--------|
| LLaVA-Video-178K (all configs) | 196,192 | 49,265 |
| With matching video files | 176,769 | 44,406 |
| Stage 2 CoTs generated | 54,840 | 1,391 |
| Stage 0 events extracted | in progress | 1,085 pilot |
## Citation
```bibtex
@article{long_grounded_thoughts,
title={Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale},
year={2025},
url={https://arxiv.org/abs/2511.05705}
}
```
## License
Apache 2.0
提供机构:
nvidia



