davinci-cart/sft-v2
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/davinci-cart/sft-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: nemotron_math_v2
data_files:
- split: train
path: nemotron_math_v2/train-*.parquet
- config_name: nemotron_science_mcq
data_files:
- split: train
path: nemotron_science_mcq/train-*.parquet
- config_name: nemotron_science_rqa
data_files:
- split: train
path: nemotron_science_rqa/train-*.parquet
- config_name: nemotron_competitive_programming
data_files:
- split: train
path: nemotron_competitive_programming/train-*.parquet
tags:
- synthetic
- text-generation
- mid-training
---
# nemotron_math_v2
Subset **`nemotron_math_v2`** of a mid-training data mix.
| Field | Value |
|---|---|
| Source dataset | `nvidia/Nemotron-Math-v2` |
| Source splits | `high_part02` |
| Processor | `NemotronMathV2Processor` |
| Rows in this push | 70,000 |
| Sample size (full run) | 70,000 |
| Generated | 2026-03-11 00:39 UTC |
## Statistics
- **Rows:** 70,000
- **Avg content length (chars):** 40,908
- **Avg turns per conversation:** 2.5
- **Categories:** math: 70,000
- **Top languages:** english: 70,000
## Schema
| Column | Type | Example |
|---|---|---|
| `messages` | list | [{'role': 'user', 'content': 'Solve the following math problem. Make sure to … |
| `source` | string | nvidia/Nemotron-Math-v2 |
| `source_split` | string | high_part02 |
| `annotator_model` | string | gpt-oss-120b |
| `data_category` | string | math |
| `answer_format` | string | None |
| `expected_answer` | string | None |
| `language` | string | english |
| `model_name` | string | None |
| `programming_language` | string | None |
| `difficulty` | string | None |
| `source_platform` | string | None |
| `code_license` | string | None |
| `num_turns` | int | 2 |
| `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("davinci-cart/sft-v2", "nemotron_math_v2", split="train")
print(ds[0]["messages"])
```
---
# nemotron_science_mcq
Subset **`nemotron_science_mcq`** of a mid-training data mix.
| Field | Value |
|---|---|
| Source dataset | `nvidia/Nemotron-Science-v1` |
| Source splits | `MCQ` |
| Processor | `NemotronScienceMCQProcessor` |
| Rows in this push | 70,000 |
| Sample size (full run) | 70,000 |
| Generated | 2026-03-11 00:39 UTC |
## Statistics
- **Rows:** 70,000
- **Avg content length (chars):** 7,903
- **Avg turns per conversation:** 2.0
- **Categories:** science: 70,000
- **Top languages:** english: 70,000
## Schema
| Column | Type | Example |
|---|---|---|
| `messages` | list | [{'role': 'user', 'content': "Solve the following multiple-choice problem. \n… |
| `source` | string | nvidia/Nemotron-Science-v1 |
| `source_split` | string | MCQ |
| `annotator_model` | string | gpt-oss-120b |
| `data_category` | string | science |
| `answer_format` | string | None |
| `expected_answer` | string | None |
| `language` | string | english |
| `model_name` | string | None |
| `programming_language` | string | None |
| `difficulty` | string | None |
| `source_platform` | string | None |
| `code_license` | string | None |
| `num_turns` | int | 2 |
| `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("davinci-cart/sft-v2", "nemotron_science_mcq", split="train")
print(ds[0]["messages"])
```
---
# nemotron_science_rqa
Subset **`nemotron_science_rqa`** of a mid-training data mix.
| Field | Value |
|---|---|
| Source dataset | `nvidia/Nemotron-Science-v1` |
| Source splits | `RQA` |
| Processor | `NemotronScienceRQAProcessor` |
| Rows in this push | 30,000 |
| Sample size (full run) | 30,000 |
| Generated | 2026-03-11 00:39 UTC |
## Statistics
- **Rows:** 30,000
- **Avg content length (chars):** 14,770
- **Avg turns per conversation:** 2.0
- **Categories:** science: 30,000
- **Top languages:** english: 30,000
## Schema
| Column | Type | Example |
|---|---|---|
| `messages` | list | [{'role': 'user', 'content': 'Solve the following problem. Make sure to put t… |
| `source` | string | nvidia/Nemotron-Science-v1 |
| `source_split` | string | RQA |
| `annotator_model` | string | gpt-oss-120b |
| `data_category` | string | science |
| `answer_format` | string | None |
| `expected_answer` | string | None |
| `language` | string | english |
| `model_name` | string | None |
| `programming_language` | string | None |
| `difficulty` | string | None |
| `source_platform` | string | None |
| `code_license` | string | None |
| `num_turns` | int | 2 |
| `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("davinci-cart/sft-v2", "nemotron_science_rqa", split="train")
print(ds[0]["messages"])
```
---
# nemotron_competitive_programming
Subset **`nemotron_competitive_programming`** of a mid-training data mix.
| Field | Value |
|---|---|
| Source dataset | `nvidia/Nemotron-Competitive-Programming-v1` |
| Source splits | `competitive_coding_cpp_part00`, `competitive_coding_cpp_part01`, `competitive_coding_python_part00`, `competitive_coding_python_part01`, `infinibyte_part00`, `infinibyte_part01` |
| Processor | `NemotronCompetitiveProgrammingProcessor` |
| Rows in this push | 60,000 |
| Sample size (full run) | 60,000 |
| Generated | 2026-03-11 00:39 UTC |
## Statistics
- **Rows:** 60,000
- **Avg content length (chars):** 54,349
- **Avg turns per conversation:** 2.0
- **Categories:** code: 60,000
## Schema
| Column | Type | Example |
|---|---|---|
| `messages` | list | [{'role': 'user', 'content': 'You are a helpful and harmless assistant. You s… |
| `source` | string | nvidia/Nemotron-Competitive-Programming-v1 |
| `source_split` | string | competitive_coding_cpp_part00 |
| `annotator_model` | string | None |
| `data_category` | string | code |
| `answer_format` | string | None |
| `expected_answer` | string | None |
| `language` | string | None |
| `model_name` | string | None |
| `programming_language` | string | None |
| `difficulty` | string | None |
| `source_platform` | string | None |
| `code_license` | string | None |
| `num_turns` | int | 2 |
| `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("davinci-cart/sft-v2", "nemotron_competitive_programming", split="train")
print(ds[0]["messages"])
```
提供机构:
davinci-cart



