nvidia/LongGroundedThoughts-video-datagen

Name: nvidia/LongGroundedThoughts-video-datagen
Creator: nvidia
Published: 2026-03-23 14:19:59
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/nvidia/LongGroundedThoughts-video-datagen

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 tags: - video-understanding - mcq-generation - chain-of-thought - temporal-reasoning - multimodal language: - en size_categories: - 100K<n<1M --- # LongGroundedThoughts — Video Data Generation Pipeline Generate temporally-grounded multiple-choice questions (MCQs) with chain-of-thought reasoning from video datasets. Based on [Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale](https://arxiv.org/abs/2511.05705). ## Pipeline Overview ``` Stage 0 (Observe) → Extract video events: speech (Whisper), scene cuts, motion ↓ Stage 1 (Ask) → Generate MCQs: direct extraction + event-grounded generation ↓ Stage 2 (Think) → Simple CoTs: Qwen2.5-VL-Instruct (10 per MCQ) ↓ Stage 3 (Think More) → Extended CoTs: DeepSeek-R1-Distilled-Qwen-32B ↓ Stage 4 (Ground) → Event-Aware CoTs: reasoning with temporal evidence ↓ → SFT + DPO training datasets ``` ## Key Features - **5 video datasets**: LLaVA-Video-178K, NExT-QA, CLEVRER, PE-Video, Ego4D - **196K MCQs** extracted from LLaVA-Video-178K (all 8 duration/source configs) - **Event-grounded MCQ generation** using Stage 0 metadata (speech, scene cuts, motion) - **MCQ rewriting** — enhances existing questions with temporal grounding - **Diverse question types** enforced per video: temporal ordering, speech-visual alignment, scene transition, cause-effect, state change, audio-visual - **Event-aware reasoning traces** — Stage 4 injects temporal evidence into CoT chains - **SFT + DPO** output formats compatible with LLaMA-Factory ## Quick Start ```bash # 1. Setup conda environment conda create -n qwen_omni python=3.10 conda activate qwen_omni pip install torch vllm pandas datasets tqdm jinja2 fire pandarallel pip install scenedetect[opencv] openai-whisper llamafactory # 2. Download annotations (all 8 configs, ~2GB) python main.py download_llava_video \ --output_dir=/workspace/data/llava_video --config_name=all # 3. Download videos (~890GB from HuggingFace archives) bash download_videos.sh # 4. Stage 0: Extract video events (GPU recommended for Whisper) python stage_0_video_analysis/extract_events.py \ --output_dir outputs/video_events \ --whisper_model large-v3 # 5. Stage 1: Extract MCQs + generate event-grounded MCQs python main.py generate_llava_video_mcq \ --data_dir /workspace/data/llava_video --config_name all python -m stage_1_llava_video.generate_grounded_mcq \ --events_dir outputs/video_events --mode vl --action both # 6. Stage 2: Simple CoT generation python main.py generate_simple_cot long_grounded_thoughts_stage_1/llava_video_all python main.py collect_simple_cot long_grounded_thoughts_stage_1/llava_video_all llava_video_all # 7. Stage 3: Extended CoT generation for PHRASE in "Wait," "Hmm," "Alternatively,"; do python main.py generate_extended_cot "|${PHRASE}|" \ long_grounded_thoughts_stage_2_thought_expansion/llava_video_all done python main.py collect_extended_cot llava_video_all # 8. Stage 4: Event-aware extended CoT export VIDEO_EVENTS_DIR=outputs/video_events for PHRASE in "Wait," "Hmm," "Alternatively,"; do python main.py generate_event_aware_cot "|${PHRASE}|" \ long_grounded_thoughts_stage_2_thought_expansion/llava_video_all done python main.py collect_event_aware_cot llava_video_all ``` ## Stage 0: Video Event Extraction Extracts structured temporal metadata from each video: | Component | Model | Output | |-----------|-------|--------| | Speech transcription | Whisper large-v3 | Timestamped dialogue with language detection | | Scene boundaries | PySceneDetect | Cut timestamps | | Motion analysis | Frame differencing | Per-segment activity level | Hallucination filtering for Whisper: checks `no_speech_prob`, `avg_logprob`, and `compression_ratio` to discard phantom transcriptions on silent audio. ```bash # GPU-accelerated (recommended, ~1s/video on A100) CUDA_VISIBLE_DEVICES=1 python stage_0_video_analysis/extract_events.py \ --whisper_model large-v3 # CPU-only (~4s/video) python stage_0_video_analysis/extract_events.py \ --whisper_model base ``` Output: `outputs/video_events/{video_id}.json` ```json { "video_path": "/path/to/video.mp4", "duration": 38.1, "scene_boundaries": [{"timestamp": 8.2}, {"timestamp": 15.1}], "speech_segments": [ {"text": "Put it on the table", "start_t": 5.2, "end_t": 6.8, "language": "en"} ], "segment_captions": [ {"text": "significant action (high motion, pixel_diff=42)", "start_t": 5.0, "end_t": 10.0} ] } ``` ## Stage 1: Event-Grounded MCQ Generation Three actions: | Action | Description | |--------|-------------| | `generate` | Create NEW MCQs from event metadata + video frames | | `rewrite` | Enhance EXISTING MCQs with temporal grounding | | `both` | Generate new + rewrite existing | Question diversity is enforced per video — one question per available event type: | Type | Requires | Example | |------|----------|---------| | `temporal_ordering` | Visual segments | "What happens AFTER the person picks up the broom?" | | `speech_visual_alignment` | Speech | "What is happening when someone says 'put it down' at 5.2s?" | | `scene_transition` | Scene cuts | "What changes after the scene cut at 12.3s?" | | `cause_effect` | Events | "WHY does the person turn around?" | | `state_change` | Visual segments | "How does the scene change between 0s and 30s?" | | `audio_visual` | Audio events | "What sound is heard during the cooking scene?" | Falls back to available types when metadata is sparse. ```bash python -m stage_1_llava_video.generate_grounded_mcq \ --events_dir outputs/video_events \ --mode vl \ --action both \ --num_questions_per_video 5 ``` ## Stage 4: Event-Aware Extended CoT Injects video event metadata into the reasoning chain before the cognitive continuation phrase: ``` [Simple CoT: "The person appears to be cooking based on the visible pots..."] [Video event metadata (45s video):] Speech: 5.2s-6.8s: "Add the salt now" Visual: 15.0s-20.0s: significant action (high motion) Wait, considering the speech at 5.2s where they say "add the salt now" and the significant activity change at 15.0s, this confirms the person is actively cooking and following a recipe... ``` The collect step compares accuracy WITH vs WITHOUT event context to measure the impact of temporal grounding. ## Project Structure ``` stage_0_video_analysis/ # Video event extraction extract_events.py # Whisper + PySceneDetect + frame analysis DESIGN.md # Research design document stage_1_clevrer/ # CLEVRER MCQ generation stage_1_ego4d/ # Ego4D MCQ generation stage_1_llava_video/ # LLaVA-Video MCQ extraction main.py # Direct MCQ conversion generate_grounded_mcq.py # Event-grounded MCQ generation + rewriting download.py # HuggingFace downloader stage_1_nextqa/ # NExT-QA MCQ generation stage_1_pe_video/ # PE-Video MCQ generation stage_2_simple_cot/ # Simple CoT (Qwen2.5-VL-Instruct) stage_3_expand_cot/ # Extended CoT (DeepSeek-R1-32B) stage_4_event_aware_cot/ # Event-aware extended CoT templates/ # Jinja2 prompt templates main.py # CLI entry point (fire) utils.py # Shared utilities download_videos.sh # Download all 182 video archives (~890GB) run_full_scale_pipeline.sh # End-to-end pipeline script outputs/ # Generated data llava_video_mcq_all.csv # 196K extracted MCQs grounded_mcq.csv # Event-grounded MCQs video_events/ # Stage 0 event JSONs stage_2_simple_cot/ # Stage 2 intermediates stage_4_event_aware_cot/ # Stage 4 intermediates sft_*.json / dpo_*.json # Training datasets ``` ## GPU Requirements | Stage | Model | GPU Memory | Notes | |-------|-------|-----------|-------| | 0 | Whisper large-v3 | ~10 GB | ~1s/video on A100 | | 1 (VL) | Qwen2.5-VL-7B | ~15 GB | TP=1 on A100 | | 2 | Qwen2.5-VL-7B | ~15 GB | TP=1, processes video frames | | 3 | DeepSeek-R1-32B | ~65 GB | TP=2, text-only | | 4 | DeepSeek-R1-32B | ~65 GB | TP=2, text-only + event context | ## Environment Variables ```bash export QWEN2_5_VL_INSTRUCT_PATH=/path/to/Qwen2.5-VL-7B-Instruct export R1_DISTILLED_QWEN_32_B=/path/to/DeepSeek-R1-Distill-Qwen-32B export LLAMAFACTORY_DIR=/path/to/LLaMA-Factory export VIDEO_EVENTS_DIR=outputs/video_events export VLLM_TENSOR_PARALLEL_SIZE=1 export VLLM_GPU_MEMORY_UTILIZATION=0.85 export VLLM_MAX_MODEL_LEN=16384 ``` ## Current Data Scale | Dataset | MCQs | Videos | |---------|------|--------| | LLaVA-Video-178K (all configs) | 196,192 | 49,265 | | With matching video files | 176,769 | 44,406 | | Stage 2 CoTs generated | 54,840 | 1,391 | | Stage 0 events extracted | in progress | 1,085 pilot | ## Citation ```bibtex @article{long_grounded_thoughts, title={Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale}, year={2025}, url={https://arxiv.org/abs/2511.05705} } ``` ## License Apache 2.0

提供机构：

nvidia

5,000+

优质数据集

54 个

任务类型

进入经典数据集