agentlans/nvidia-Nemotron-Science-Math

Name: agentlans/nvidia-Nemotron-Science-Math
Creator: agentlans
Published: 2026-04-15 00:00:42
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/agentlans/nvidia-Nemotron-Science-Math

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en tags: - chain-of-thought - reasoning - math - science - nemotron - nlp - synthetic-data size_categories: - 100K<n<1M --- # NVIDIA Nemotron Science and Math Reasoning This is an unofficial, curated collection derived from NVIDIA's open-source Nemotron datasets. It is designed specifically to train language models in complex scientific and mathematical reasoning by providing structured chain-of-thought (CoT) examples. To ensure efficiency, the shortest available CoT sequence was chosen for each question, filtering out redundant variations while preserving the core logical progression toward the final answer. ## Data Composition | Source | Split/Type | Rows | | :--- | :--- | ---: | | [nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1) | MCQ | 174 154 | | [nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) | RQA | 70 054 | | [nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) | stem | 66 392 | | [nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) | math | 58 637 | | [nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1) | RQA | 52 179 | *Note: For the Post-Training Dataset splits, a maximum of 500 000 rows per split was applied during processing.* ## Data Structure Each entry is formatted as a JSON object. * **`id`**: A unique identifier (UUID) or the original ID from the source dataset. * **`question`**: The scientific or mathematical problem prompt. * **`answer`**: The final concise answer. * **`thought`**: The shortest chain-of-thought extracted from the original `<think>` tags. * **`source`**: The original dataset and split origin. ### Example Entry ```json { "id": "4f184420-aea5-41e9-9a33-c398d5366895", "question": "Solve the following problem... [Problem Details] ...", "answer": "\\[\n\\boxed{-1.48\\text{eV}}\n\\]", "thought": "We need to parse the problem... [Detailed reasoning steps] ...", "source": "nvidia/Nemotron-Science-v1 RQA" } ``` ## Limitations * **Subset Scope**: This represents only a portion of the original NVIDIA datasets. * **Functional Constraints**: The dataset does not include multi-turn conversations, tool-calling capabilities, or visual-reasoning tasks. * **Model Performance**: Users should be aware that LLMs may still exhibit hallucinations or errors in complex multi-step reasoning, regardless of training data quality. ## Licensing This dataset is released under the **Creative Commons Attribution 4.0 (CC-BY-4.0)** license. Please ensure you provide proper attribution to the original creators at NVIDIA and cite this repository when utilizing this data for training or fine-tuning models.

提供机构：

agentlans

5,000+

优质数据集

54 个

任务类型

进入经典数据集