nvidia/nemotron-research-lgt
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nvidia/nemotron-research-lgt
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
tags:
- reasoning
- vision
- multimodal
- vlm
- chain-of-thought
- nvidia
size_categories:
- 1M<n<10M
configs:
- config_name: sft_stage1
data_files:
- split: train
path: data/sft_stage1/train-*.parquet
- config_name: sft_stage2
data_files:
- split: train
path: data/sft_stage2/train-*.parquet
- config_name: sft_stage2_merged
data_files:
- split: train
path: data/sft_stage2_merged/train-*.parquet
- config_name: dpo_stage1
data_files:
- split: train
path: data/dpo_stage1/train-*.parquet
- config_name: dpo_stage2
data_files:
- split: train
path: data/dpo_stage2/train-*.parquet
---
<div align="center">
# LGT-1M+
**Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale**
[](https://arxiv.org/abs/2511.05705)
[](https://creativecommons.org/licenses/by-nc/4.0/)
</div>
**LGT** is a synthetic visual reasoning dataset containing **1M+ new visual reasoning problems** of varying complexity with reasoning traces exhibiting complex cognitive behaviours. The dataset supports the entire VLM post-training spectrum: SFT, offline RL (DPO), and online RL (GRPO/RLVR).
This dataset is for research and development only.
## Long Grounded Thoughts (LGT)
We identify two critical bottlenecks in current vision-centric reasoning data synthesis frameworks when pushed to scale: 1) **problem synthesis saturation** — relying solely on captions for problem synthesis limits the diversity and grounding of the generated vision-centric questions, causing downstream performance to saturate quickly, and 2) **cognitive simplicity** — the synthesized reasoning traces still lack higher-order reasoning structures, in part because of the simplicity of the underlying problems and lack of diversity in the synthesis.
LGT tackles these core challenges focusing on **scale**, **complexity**, and **CoT richness**:
- **Scale.** LGT overcomes the bottlenecks of caption-only synthesis methods by leveraging grounded object metadata, scaling novel problem synthesis from 30K to over 1M+ high-quality problems without performance saturation.
- **Complexity.** LGT significantly increases vision-centric problem synthesis difficulty. The composition hardening algorithm reduces the rate of trivially solvable questions by ~10x (decreasing from 36.7% in baselines to 3.3%).
- **CoT Richness.** LGT enhances the cognitive depth of the synthesized CoT traces. The method increases the frequency of complex cognitive behaviors by +206%, yielding traces that are 3x richer than the baseline.
## Dataset Overview
| Config | Split | Examples | Format | Description |
|--------|-------|----------|--------|-------------|
| `sft_stage1` | train | 753,675 | SFT | Grounded MCQs with long CoT traces (~750K) |
| `sft_stage2` | train | 163,873 | SFT | Composition-hardened MCQs with long CoT traces |
| `sft_stage2_merged` | train | 243,873 | SFT | Stage 2 Composition-hardened MCQs with long CoT traces merged with additional data |
| `dpo_stage1` | train | 129,318 | DPO | Preference pairs from Stage 1 MCQs |
| `dpo_stage2` | train | 15,685 | DPO | Preference pairs from Stage 2 composition-hardened MCQs |
**Source images:** [DOCCI](https://google.github.io/docci/) (Descriptions of Connected and Contrasting Images). Images are referenced by DOCCI filename (e.g., `train_00557.jpg`) and must be obtained from the original DOCCI dataset.
## Data Format
### SFT Configs (`sft_stage1`, `sft_stage2`, `sft_stage2_merged`)
| Field | Type | Description |
|-------|------|-------------|
| `image` | `str` | DOCCI image filename (e.g., `train_00557.jpg`) |
| `question` | `str` | Visual MCQ with `<image>` tag and answer choices |
| `response` | `str` | Long CoT reasoning trace in `<think>...</think><answer>...</answer>` format |
### DPO Configs (`dpo_stage1`, `dpo_stage2`)
| Field | Type | Description |
|-------|------|-------------|
| `image` | `str` | DOCCI image filename |
| `question` | `str` | Visual MCQ with `<image>` tag and answer choices |
| `chosen` | `str` | Preferred reasoning trace (correct answer) |
| `rejected` | `str` | Dispreferred reasoning trace (incorrect answer) |
### Example (SFT)
```json
{
"image": "train_03916.jpg",
"question": "<image>Which of the following correctly combines the number of cars in the left lane and the number of shorter, wider beams connecting the vertical red metal beam on the near side of the bridge?\nSelect from the following choices.\n(A) 1 car and 3 beams\n(B) 2 cars and 4 beams\n(C) 2 cars and 5 beams\n(D) 3 cars and 4 beams",
"response": "<think> To solve this problem, let's break it down step by step:\n\n1. **Count the number of cars in the left lane:**\n - In the image, there are two cars visible in the left lane.\n\n2. **Count the number of shorter, wider beams connecting the vertical red metal beam on the near side of the bridge:**\n - Looking at the vertical red metal beam on the near side of the bridge, we can see that there are five shorter, wider beams connecting it to the horizontal structure above.\n\n3. **Combine the counts:**\n - We have 2 cars and 5 beams.\n\n... Wait, that doesn't seem right. Let me double-check the beams. The image shows four shorter, wider beams connecting the vertical poles on the near side. I must have miscounted earlier. Yes, it's four beams.\n\nSo the correct combination should be 2 cars and 4 beams. </think> <answer> ...."
}
```
## Usage
```python
from datasets import load_dataset
# Load a specific config
sft_stage1 = load_dataset("nvidia/nemotron-research-LGT", "sft_stage1", split="train")
sft_stage2 = load_dataset("nvidia/nemotron-research-LGT", "sft_stage2", split="train")
# or stage2_merged
# ---
dpo_stage1 = load_dataset("nvidia/nemotron-research-LGT", "dpo_stage1", split="train")
dpo_stage2 = load_dataset("nvidia/nemotron-research-LGT", "dpo_stage2", split="train")
print(sft_stage1[0])
```
## Citation
If you find this dataset or the LGT paper helpful, please cite:
```bibtex
@article{acuna2025lgt,
title={Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale},
author={Acuna, David and Yang, Chao-Han Huck and Deng, Yuntian and Jung, Jaehun and Lu, Ximing and Ammanabrolu, Prithviraj and Kim, Hyunwoo and Liao, Yuan-Hong and Choi, Yejin},
journal={arXiv preprint arXiv:2511.05705},
year={2026}
}
```
提供机构:
nvidia



