five

davinci-cart/sft-v2

收藏
Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/davinci-cart/sft-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: nemotron_math_v2 data_files: - split: train path: nemotron_math_v2/train-*.parquet - config_name: nemotron_science_mcq data_files: - split: train path: nemotron_science_mcq/train-*.parquet - config_name: nemotron_science_rqa data_files: - split: train path: nemotron_science_rqa/train-*.parquet - config_name: nemotron_competitive_programming data_files: - split: train path: nemotron_competitive_programming/train-*.parquet tags: - synthetic - text-generation - mid-training --- # nemotron_math_v2 Subset **`nemotron_math_v2`** of a mid-training data mix. | Field | Value | |---|---| | Source dataset | `nvidia/Nemotron-Math-v2` | | Source splits | `high_part02` | | Processor | `NemotronMathV2Processor` | | Rows in this push | 70,000 | | Sample size (full run) | 70,000 | | Generated | 2026-03-11 00:39 UTC | ## Statistics - **Rows:** 70,000 - **Avg content length (chars):** 40,908 - **Avg turns per conversation:** 2.5 - **Categories:** math: 70,000 - **Top languages:** english: 70,000 ## Schema | Column | Type | Example | |---|---|---| | `messages` | list | [{'role': 'user', 'content': 'Solve the following math problem. Make sure to … | | `source` | string | nvidia/Nemotron-Math-v2 | | `source_split` | string | high_part02 | | `annotator_model` | string | gpt-oss-120b | | `data_category` | string | math | | `answer_format` | string | None | | `expected_answer` | string | None | | `language` | string | english | | `model_name` | string | None | | `programming_language` | string | None | | `difficulty` | string | None | | `source_platform` | string | None | | `code_license` | string | None | | `num_turns` | int | 2 | | `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… | ## Usage ```python from datasets import load_dataset ds = load_dataset("davinci-cart/sft-v2", "nemotron_math_v2", split="train") print(ds[0]["messages"]) ``` --- # nemotron_science_mcq Subset **`nemotron_science_mcq`** of a mid-training data mix. | Field | Value | |---|---| | Source dataset | `nvidia/Nemotron-Science-v1` | | Source splits | `MCQ` | | Processor | `NemotronScienceMCQProcessor` | | Rows in this push | 70,000 | | Sample size (full run) | 70,000 | | Generated | 2026-03-11 00:39 UTC | ## Statistics - **Rows:** 70,000 - **Avg content length (chars):** 7,903 - **Avg turns per conversation:** 2.0 - **Categories:** science: 70,000 - **Top languages:** english: 70,000 ## Schema | Column | Type | Example | |---|---|---| | `messages` | list | [{'role': 'user', 'content': "Solve the following multiple-choice problem. \n… | | `source` | string | nvidia/Nemotron-Science-v1 | | `source_split` | string | MCQ | | `annotator_model` | string | gpt-oss-120b | | `data_category` | string | science | | `answer_format` | string | None | | `expected_answer` | string | None | | `language` | string | english | | `model_name` | string | None | | `programming_language` | string | None | | `difficulty` | string | None | | `source_platform` | string | None | | `code_license` | string | None | | `num_turns` | int | 2 | | `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… | ## Usage ```python from datasets import load_dataset ds = load_dataset("davinci-cart/sft-v2", "nemotron_science_mcq", split="train") print(ds[0]["messages"]) ``` --- # nemotron_science_rqa Subset **`nemotron_science_rqa`** of a mid-training data mix. | Field | Value | |---|---| | Source dataset | `nvidia/Nemotron-Science-v1` | | Source splits | `RQA` | | Processor | `NemotronScienceRQAProcessor` | | Rows in this push | 30,000 | | Sample size (full run) | 30,000 | | Generated | 2026-03-11 00:39 UTC | ## Statistics - **Rows:** 30,000 - **Avg content length (chars):** 14,770 - **Avg turns per conversation:** 2.0 - **Categories:** science: 30,000 - **Top languages:** english: 30,000 ## Schema | Column | Type | Example | |---|---|---| | `messages` | list | [{'role': 'user', 'content': 'Solve the following problem. Make sure to put t… | | `source` | string | nvidia/Nemotron-Science-v1 | | `source_split` | string | RQA | | `annotator_model` | string | gpt-oss-120b | | `data_category` | string | science | | `answer_format` | string | None | | `expected_answer` | string | None | | `language` | string | english | | `model_name` | string | None | | `programming_language` | string | None | | `difficulty` | string | None | | `source_platform` | string | None | | `code_license` | string | None | | `num_turns` | int | 2 | | `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… | ## Usage ```python from datasets import load_dataset ds = load_dataset("davinci-cart/sft-v2", "nemotron_science_rqa", split="train") print(ds[0]["messages"]) ``` --- # nemotron_competitive_programming Subset **`nemotron_competitive_programming`** of a mid-training data mix. | Field | Value | |---|---| | Source dataset | `nvidia/Nemotron-Competitive-Programming-v1` | | Source splits | `competitive_coding_cpp_part00`, `competitive_coding_cpp_part01`, `competitive_coding_python_part00`, `competitive_coding_python_part01`, `infinibyte_part00`, `infinibyte_part01` | | Processor | `NemotronCompetitiveProgrammingProcessor` | | Rows in this push | 60,000 | | Sample size (full run) | 60,000 | | Generated | 2026-03-11 00:39 UTC | ## Statistics - **Rows:** 60,000 - **Avg content length (chars):** 54,349 - **Avg turns per conversation:** 2.0 - **Categories:** code: 60,000 ## Schema | Column | Type | Example | |---|---|---| | `messages` | list | [{'role': 'user', 'content': 'You are a helpful and harmless assistant. You s… | | `source` | string | nvidia/Nemotron-Competitive-Programming-v1 | | `source_split` | string | competitive_coding_cpp_part00 | | `annotator_model` | string | None | | `data_category` | string | code | | `answer_format` | string | None | | `expected_answer` | string | None | | `language` | string | None | | `model_name` | string | None | | `programming_language` | string | None | | `difficulty` | string | None | | `source_platform` | string | None | | `code_license` | string | None | | `num_turns` | int | 2 | | `chat_template_kwargs` | dict | {'add_generation_prompt': False, 'enable_thinking': True, 'python_tools': [],… | ## Usage ```python from datasets import load_dataset ds = load_dataset("davinci-cart/sft-v2", "nemotron_competitive_programming", split="train") print(ds[0]["messages"]) ```
提供机构:
davinci-cart
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作