five

agentlans/nvidia-Nemotron-Science-Math

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/nvidia-Nemotron-Science-Math
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering language: - en tags: - chain-of-thought - reasoning - math - science - nemotron - nlp - synthetic-data size_categories: - 100K<n<1M --- # NVIDIA Nemotron Science and Math Reasoning This is an unofficial, curated collection derived from NVIDIA's open-source Nemotron datasets. It is designed specifically to train language models in complex scientific and mathematical reasoning by providing structured chain-of-thought (CoT) examples. To ensure efficiency, the shortest available CoT sequence was chosen for each question, filtering out redundant variations while preserving the core logical progression toward the final answer. ## Data Composition | Source | Split/Type | Rows | | :--- | :--- | ---: | | [nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1) | MCQ | 174&thinsp;154 | | [nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2) | RQA | 70&thinsp;054 | | [nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) | stem | 66&thinsp;392 | | [nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) | math | 58&thinsp;637 | | [nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1) | RQA | 52&thinsp;179 | *Note: For the Post-Training Dataset splits, a maximum of 500&thinsp;000 rows per split was applied during processing.* ## Data Structure Each entry is formatted as a JSON object. * **`id`**: A unique identifier (UUID) or the original ID from the source dataset. * **`question`**: The scientific or mathematical problem prompt. * **`answer`**: The final concise answer. * **`thought`**: The shortest chain-of-thought extracted from the original `<think>` tags. * **`source`**: The original dataset and split origin. ### Example Entry ```json { "id": "4f184420-aea5-41e9-9a33-c398d5366895", "question": "Solve the following problem... [Problem Details] ...", "answer": "\\[\n\\boxed{-1.48\\text{eV}}\n\\]", "thought": "We need to parse the problem... [Detailed reasoning steps] ...", "source": "nvidia/Nemotron-Science-v1 RQA" } ``` ## Limitations * **Subset Scope**: This represents only a portion of the original NVIDIA datasets. * **Functional Constraints**: The dataset does not include multi-turn conversations, tool-calling capabilities, or visual-reasoning tasks. * **Model Performance**: Users should be aware that LLMs may still exhibit hallucinations or errors in complex multi-step reasoning, regardless of training data quality. ## Licensing This dataset is released under the **Creative Commons Attribution 4.0 (CC-BY-4.0)** license. Please ensure you provide proper attribution to the original creators at NVIDIA and cite this repository when utilizing this data for training or fine-tuning models.
提供机构:
agentlans
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作