five

AM-DeepSeek-R1-Distilled-1.4M

收藏
魔搭社区2026-05-23 更新2025-03-29 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/AM-DeepSeek-R1-Distilled-1.4M
下载链接
链接失效反馈
官方服务:
资源简介:
**For more open-source datasets, models, and methodologies, please visit our [GitHub repository](https://github.com/a-m-team/a-m-models).** [AM-DeepSeek-R1-Distilled-1.4M](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M) is a large-scale general reasoning task dataset composed of high-quality and challenging reasoning problems. These problems are collected from numerous open-source datasets, semantically deduplicated, and cleaned to eliminate test set contamination. All responses in the dataset are distilled from the reasoning model (mostly DeepSeek-R1) and have undergone rigorous verification: mathematical problems are validated through answer checking, code problems via test cases, and other tasks through reward model evaluation. Specifically, responses in am_0.5M.jsonl are distilled by other open-source datasets, while those in am_0.9M.jsonl are distilled from the DeepSeek-R1-671B by the [AM team](https://huggingface.co/a-m-team). We have validated the dataset through model training, confirming its effectiveness and demonstrating performance comparable to the distilled models from the DeepSeek team, and the details can be found in our technique reports [1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Traning](https://github.com/a-m-team/a-m-models/blob/main/docs/AM-DeepSeek-R1-Distilled-Dataset.pdf) We are releasing these 1.4 million problems and responses to the research community, aiming to foster advancements in powerful reasoning-oriented Large Language Models (LLMs). We sincerely thank the open-source community. Without their support, we would never have come this far. ## Model Training Performance based on this dataset ![alt text](AM-DeepSeek-R1-Distilled.jpeg) ## Scale & Composition - AM-DeepSeek-R1-Distilled-1.4M: An Open-source Chinese & English dataset with reasoning traces (1.4 million entries). - 0.5 million entries of data are entirely from open-source datasets, and 0.9 million entries of data are distilled from DeepSeek-R1-671B by AM team, which can be seen in the "am-0309" of the Response sources. - Data processing: - Instruction part: we used the large language model to score all the instructions in terms of difficulty and category. Then we filtered the data according to the difficulty and category labels and conducted strict semantic deduplication to ensure the high-quality and diversity of the instructions. - Response part: we verified part of the data with reference answers using both rules and the large language model respectively. For code data with test cases, we used a sandbox for verification. Additionally, we used a reward model to score all the generated results and performed rejection sampling based on the rules and the reward model. ## Unified Format #### Dedicated fields - **`messages`**: Array with user - assistant interaction objects, Each having `role`、`content` and `info` fields. - **`role`**: Indicates whether it's user or assistant, defining the sender of the message. - **`content`**: Follows `<think> reasoning process </think> <answer> solution </answer>` format in assistant message. - **`info`**: Contains metadata. - **`source`**: Instrction or response source. - **`reference_answer`**: Ground-truth answers only for some math, science, and other problems. - **`test_case`**: Test cases only for some code-related problems. - **`think_content`**: Reasoning steps in assistant message. - **`answer_content`**: Final solution in assistant message. #### sample ```bash { "messages":[{ "role": "user", "content": "Find the integral of \\(e^{-x} + e^{-2x} dx\\).", "info":{ "source":"natural_reasoning", "reference_answer": "-e^{-x} - \\frac{1}{2}e^{-2x} + C", "test_case": null } }, { "role": "assistant", "content": "<think>Okay, let's see. The user wants to find the integral of e^{-x} + e^{-2x} dx. Hmm, integrals of exponential functions. I remember that the integral of e^{kx} dx is (1/k)e^{kx} + C, right? But here the exponents are negative. Let me think.\n\nFirst, maybe I can split the integral into two parts because it's the sum of two functions. So, ∫(e^{-x} + e^{-2x}) dx = ∫e^{-x} dx + ∫e^{-2x} dx. That makes sense because integration is linear. Now, handle each integral separately.\n\nFor the first integral, ∫e^{-x} dx. If I use the formula, normally ∫e^{kx} dx = (1/k)e^{kx} + C. Here, k is -1. So substituting, that would be (1/(-1))e^{-x} + C, which simplifies to -e^{-x} + C. Wait, but since we're doing an indefinite integral here, each term will have its own constant, but since they're both combined, we'll have one constant at the end.\n\nNow the second integral: ∫e^{-2x} dx. Again, using the same formula, where k is -2. So that becomes (1/(-2))e^{-2x} + C, which is (-1/2)e^{-2x} + C.\n\nPutting both results together: the first integral is -e^{-x} and the second is (-1/2)e^{-2x}, so adding them and combining the constants gives the final answer. Let me check the signs again. The first term's coefficient is -1, and the second is -1/2. So combining those, it should be -e^{-x} - (1/2)e^{-2x} + C. Yeah, that looks right.\n\nLet me verify by differentiating the result. If I take the derivative of -e^{-x}, that's (-1)(-e^{-x}) = e^{-x}. For the second term, derivative of -1/2 e^{-2x} is (-1/2)(-2)e^{-2x} = e^{-2x}. So adding those derivatives gives e^{-x} + e^{-2x}, which matches the original integrand. Great, so the integration is correct. The constants of integration from both terms just combine into a single constant C at the end.\n</think>\n<answer>\nThe integral of \\(e^{-x} + e^{-2x}\\) with respect to \\(x\\) is:\n\n\\[\n\\int \\left(e^{-x} + e^{-2x}\\right) dx = -e^{-x} - \\frac{1}{2}e^{-2x} + C\n\\]\n\nwhere \\(C\\) is the constant of integration.\n</answer>", "info": { "source": "am-0309", "think_content": "Okay, let's see. The user wants to find the integral of e^{-x} + e^{-2x} dx. Hmm, integrals of exponential functions. I remember that the integral of e^{kx} dx is (1/k)e^{kx} + C, right? But here the exponents are negative. Let me think.\n\nFirst, maybe I can split the integral into two parts because it's the sum of two functions. So, ∫(e^{-x} + e^{-2x}) dx = ∫e^{-x} dx + ∫e^{-2x} dx. That makes sense because integration is linear. Now, handle each integral separately.\n\nFor the first integral, ∫e^{-x} dx. If I use the formula, normally ∫e^{kx} dx = (1/k)e^{kx} + C. Here, k is -1. So substituting, that would be (1/(-1))e^{-x} + C, which simplifies to -e^{-x} + C. Wait, but since we're doing an indefinite integral here, each term will have its own constant, but since they're both combined, we'll have one constant at the end.\n\nNow the second integral: ∫e^{-2x} dx. Again, using the same formula, where k is -2. So that becomes (1/(-2))e^{-2x} + C, which is (-1/2)e^{-2x} + C.\n\nPutting both results together: the first integral is -e^{-x} and the second is (-1/2)e^{-2x}, so adding them and combining the constants gives the final answer. Let me check the signs again. The first term's coefficient is -1, and the second is -1/2. So combining those, it should be -e^{-x} - (1/2)e^{-2x} + C. Yeah, that looks right.\n\nLet me verify by differentiating the result. If I take the derivative of -e^{-x}, that's (-1)(-e^{-x}) = e^{-x}. For the second term, derivative of -1/2 e^{-2x} is (-1/2)(-2)e^{-2x} = e^{-2x}. So adding those derivatives gives e^{-x} + e^{-2x}, which matches the original integrand. Great, so the integration is correct. The constants of integration from both terms just combine into a single constant C at the end.\n", "answer_content": "\nThe integral of \\(e^{-x} + e^{-2x}\\) with respect to \\(x\\) is:\n\n\\[\n\\int \\left(e^{-x} + e^{-2x}\\right) dx = -e^{-x} - \\frac{1}{2}e^{-2x} + C\n\\]\n\nwhere \\(C\\) is the constant of integration.\n" } }] } ``` ## Usage The dataset is split into two compressed files based on response sources: - **`am_0.9M.jsonl.zst`**: Responses from the `am-0309` source. - **`am_0.5M.jsonl.zst`**: Responses from other sources. - Additionally, a subset of 1,000 random samples (`am_0.9M_1k.jsonl`) from `am-0309` is provided for quick experimentation. Files are compressed using [zstd](https://github.com/facebook/zstd) for faster download and reduced storage requirements. **Decompression Instructions**: ```bash apt install zstd zstd -d am_0.9M.jsonl.zst -o am_0.9M.jsonl ``` ## Sources - Open-source data: Instructions and reasoning traces from existing datasets. - AM distilled data: High-quality instructions from the Open-source dataset, augmented with reasoning traces and solutions generated by DeepSeek-R1. #### Instruction sources | Source | Nums | | --- | --- | | natural_reasoning | 319085 | | InfinityInstruct | 306675 | | KodCode | 210838 | | Dolphin - R1 | 63921 | | openR1Math_extended | 63290 | | NuminaMath_1.5 | 62446 | | openR1Math_default | 62239 | | codeio | 55176 | | GeneralThought - Feb25 | 50600 | | openThoughts | 34620 | | OpenCoder | 22249 | | data_ablation_full59K | 14155 | | MetaMathQA | 14083 | | ... | ... | #### Response sources | Source | Nums | | --- | --- | | am-0309 | 900000 | | KodCode | 210838 | | openR1Math_extended | 63290 | | Dolphin - R1 | 62750 | | openR1Math_default | 60839 | | GeneralThought - Feb25 | 50600 | | openThoughts | 31431 | | data_ablation_full59K | 14155 | | Bespoke17k | 5747 | | ... | ... | ## Limitation and Usage Limits We require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed. Since this dataset was generated by LLM and was not strictly verified, it still has shortcomings regarding factuality and other aspects. When using this dataset, careful inspection is needed. This dataset does not represent anyone's ground, interest or thought, and is not related to any kind of claim of any groups. The developers of this project do not assume any responsibility to potential harm inflicted by using this dataset and project. Due to the nested relationships among the sources of some data, there may be issues regarding the inaccuracy of the sources. ## Citation If you use this data, please cite with the following BibTex entry: ``` @misc{zhao202514millionopensourcedistilled, title={1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training}, author={Han Zhao and Haotian Wang and Yiping Peng and Sitong Zhao and Xiaoyu Tian and Shuaiting Chen and Yunjie Ji and Xiangang Li}, year={2025}, eprint={2503.19633}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.19633}, } ```

如需获取更多开源数据集、模型与方法论,请访问我们的[GitHub仓库](https://github.com/a-m-team/a-m-models)。 [AM-DeepSeek-R1-Distilled-1.4M](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M) 是一款大规模通用推理任务数据集,由高质量且兼具挑战性的推理问题构成。这些问题源自大量开源数据集,经过语义去重与清洗处理,以消除测试集污染。数据集中的所有回复均由推理模型(以DeepSeek-R1为主)蒸馏得到,并经过严格验证:数学类问题通过答案校验进行验证,代码类问题通过测试用例验证,其余任务则通过奖励模型(reward model)进行评估。具体而言,`am_0.5M.jsonl` 中的回复由其他开源数据集蒸馏得到,而 `am_0.9M.jsonl` 中的回复则由AM团队基于DeepSeek-R1-671B蒸馏生成。 我们通过模型训练对该数据集进行了有效性验证,证实其性能可媲美DeepSeek团队发布的蒸馏模型,详细信息可参阅我们的技术报告 [1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training](https://github.com/a-m-team/a-m-models/blob/main/docs/AM-DeepSeek-R1-Distilled-Dataset.pdf)。 我们将这140万条问题与回复开源发布给学术研究社区,旨在推动面向强大推理能力的大语言模型(Large Language Model, LLM)的技术进步。我们由衷感谢开源社区的支持,若无社区助力,我们无法取得今日的进展。 ## 基于本数据集的模型训练性能 ![alt text](AM-DeepSeek-R1-Distilled.jpeg) ## 数据集规模与构成 - AM-DeepSeek-R1-Distilled-1.4M:一款支持中英双语、带推理轨迹的开源数据集,共包含140万条数据条目。 - 其中50万条数据完全源自开源数据集,剩余90万条数据由AM团队基于DeepSeek-R1-671B蒸馏生成,对应回复来源标记为`am-0309`。 - 数据处理流程: - 指令部分:我们使用大语言模型对所有指令进行难度与类别评分,随后基于难度与类别标签筛选数据,并执行严格的语义去重操作,以保障指令的高质量与多样性。 - 回复部分:我们分别通过规则与大语言模型结合参考答案对部分数据进行验证;对于带测试用例的代码数据,我们使用沙箱(sandbox)进行校验。此外,我们通过奖励模型对所有生成结果进行评分,并基于规则与奖励模型执行拒绝采样。 ## 统一数据格式 #### 专用字段 - **`messages`**:用户-助手交互对象数组,每个对象包含`role`、`content`与`info`三个字段。 - **`role`**:标识消息发送者身份,可选值为`user`或`assistant`。 - **`content`**:助手消息的内容遵循`<think> 推理过程 </think> <answer> 解决方案 </answer>`的格式。 - **`info`**:包含元数据信息。 - **`source`**:指令或回复的来源标识。 - **`reference_answer`**:仅部分数学、科学及其他问题的标准答案。 - **`test_case`**:仅部分代码相关问题的测试用例。 - **`think_content`**:助手消息中的推理步骤内容。 - **`answer_content`**:助手消息中的最终解决方案内容。 #### 数据示例 bash { "messages":[{ "role": "user", "content": "Find the integral of \(e^{-x} + e^{-2x} dx\).", "info":{ "source":"natural_reasoning", "reference_answer": "-e^{-x} - \frac{1}{2}e^{-2x} + C", "test_case": null } }, { "role": "assistant", "content": "<think>Okay, let's see. The user wants to find the integral of e^{-x} + e^{-2x} dx. Hmm, integrals of exponential functions. I remember that the integral of e^{kx} dx is (1/k)e^{kx} + C, right? But here the exponents are negative. Let me think. First, maybe I can split the integral into two parts because it's the sum of two functions. So, ∫(e^{-x} + e^{-2x}) dx = ∫e^{-x} dx + ∫e^{-2x} dx. That makes sense because integration is linear. Now, handle each integral separately. For the first integral, ∫e^{-x} dx. If I use the formula, normally ∫e^{kx} dx = (1/k)e^{kx} + C. Here, k is -1. So substituting, that would be (1/(-1))e^{-x} + C, which simplifies to -e^{-x} + C. Wait, but since we're doing an indefinite integral here, each term will have its own constant, but since they're both combined, we'll have one constant at the end. Now the second integral: ∫e^{-2x} dx. Again, using the same formula, where k is -2. So that becomes (1/(-2))e^{-2x} + C, which is (-1/2)e^{-2x} + C. Putting both results together: the first integral is -e^{-x} and the second is (-1/2)e^{-2x}, so adding them and combining the constants gives the final answer. Let me check the signs again. The first term's coefficient is -1, and the second is -1/2. So combining those, it should be -e^{-x} - (1/2)e^{-2x} + C. Yeah, that looks right. Let me verify by differentiating the result. If I take the derivative of -e^{-x}, that's (-1)(-e^{-x}) = e^{-x}. For the second term, derivative of -1/2 e^{-2x} is (-1/2)(-2)e^{-2x} = e^{-2x}. So adding those derivatives gives e^{-x} + e^{-2x}, which matches the original integrand. Great, so the integration is correct. The constants of integration from both terms just combine into a single constant C at the end. </think> <answer> The integral of \(e^{-x} + e^{-2x}\) with respect to \(x\) is: \[ \int \left(e^{-x} + e^{-2x}\right) dx = -e^{-x} - \frac{1}{2}e^{-2x} + C \] where \(C\) is the constant of integration. </answer>", "info": { "source": "am-0309", "think_content": "Okay, let's see. The user wants to find the integral of e^{-x} + e^{-2x} dx. Hmm, integrals of exponential functions. I remember that the integral of e^{kx} dx is (1/k)e^{kx} + C, right? But here the exponents are negative. Let me think. First, maybe I can split the integral into two parts because it's the sum of two functions. So, ∫(e^{-x} + e^{-2x}) dx = ∫e^{-x} dx + ∫e^{-2x} dx. That makes sense because integration is linear. Now, handle each integral separately. For the first integral, ∫e^{-x} dx. If I use the formula, normally ∫e^{kx} dx = (1/k)e^{kx} + C. Here, k is -1. So substituting, that would be (1/(-1))e^{-x} + C, which simplifies to -e^{-x} + C. Wait, but since we're doing an indefinite integral here, each term will have its own constant, but since they're both combined, we'll have one constant at the end. Now the second integral: ∫e^{-2x} dx. Again, using the same formula, where k is -2. So that becomes (1/(-2))e^{-2x} + C, which is (-1/2)e^{-2x} + C. Putting both results together: the first integral is -e^{-x} and the second is (-1/2)e^{-2x}, so adding them and combining the constants gives the final answer. Let me check the signs again. The first term's coefficient is -1, and the second is -1/2. So combining those, it should be -e^{-x} - (1/2)e^{-2x} + C. Yeah, that looks right. Let me verify by differentiating the result. If I take the derivative of -e^{-x}, that's (-1)(-e^{-x}) = e^{-x}. For the second term, derivative of -1/2 e^{-2x} is (-1/2)(-2)e^{-2x} = e^{-2x}. So adding those derivatives gives e^{-x} + e^{-2x}, which matches the original integrand. Great, so the integration is correct. The constants of integration from both terms just combine into a single constant C at the end. ", "answer_content": " The integral of \(e^{-x} + e^{-2x}\) with respect to \(x\) is: \[ \int \left(e^{-x} + e^{-2x}\right) dx = -e^{-x} - \frac{1}{2}e^{-2x} + C \] where \(C\) is the constant of integration. " } }] } ## 使用方式 本数据集根据回复来源分为两个压缩文件: - **`am_0.9M.jsonl.zst`**:包含`am-0309`来源的回复数据。 - **`am_0.5M.jsonl.zst`**:包含其他来源的回复数据。 - 此外,我们还提供了来自`am-0309`的1000条随机采样子集(`am_0.9M_1k.jsonl`),用于快速实验验证。 所有文件均采用[zstd](https://github.com/facebook/zstd)格式压缩,以加快下载速度并减少存储空间占用。 **解压说明**: bash apt install zstd zstd -d am_0.9M.jsonl.zst -o am_0.9M.jsonl **使用`load_dataset`加载数据集** python from datasets import load_dataset, Features, Value features = Features({ "messages": [ { "role": Value("string"), "content": Value("string"), "info": { "source": Value("string"), "reference_answer": Value("string"), "test_case": Value("string"), "think_content": Value("string"), "answer_content": Value("string") } } ] }) # Take downloading "am_0.9M_sample_1k.jsonl" as an example. data = load_dataset('a-m-team/AM-DeepSeek-R1-Distilled-1.4M', 'am_0.9M_sample_1k', features=features) ## 数据来源 - 开源数据集部分:指令与推理轨迹均源自现有公开数据集。 - AM蒸馏数据部分:源自开源数据集的高质量指令,辅以DeepSeek-R1生成的推理轨迹与解决方案。 #### 指令来源 | 来源 | 数量 | | --- | --- | | 自然推理(natural_reasoning) | 319085 | | InfinityInstruct | 306675 | | KodCode | 210838 | | Dolphin-R1 | 63921 | | openR1Math_extended | 63290 | | NuminaMath_1.5 | 62446 | | openR1Math_default | 62239 | | codeio | 55176 | | GeneralThought-Feb25 | 50600 | | openThoughts | 34620 | | OpenCoder | 22249 | | data_ablation_full59K | 14155 | | MetaMathQA | 14083 | | ... | ... | #### 回复来源 | 来源 | 数量 | | --- | --- | | am-0309 | 900000 | | KodCode | 210838 | | openR1Math_extended | 63290 | | Dolphin-R1 | 62750 | | openR1Math_default | 60839 | | GeneralThought-Feb25 | 50600 | | openThoughts | 31431 | | data_ablation_full59K | 14155 | | Bespoke17k | 5747 | | ... | ... | ## 使用限制与免责声明 我们仅允许开发者将本项目开源的代码、数据、模型及其他产出物用于学术研究用途,严禁商业使用或其他潜在有害的应用场景。 由于本数据集由大语言模型生成且未经过完全严格的验证,在事实性等方面仍存在不足,使用时需进行仔细校验。 本数据集不代表任何个人或团体的立场、利益与观点,也不代表任何团体的主张。本项目开发者不对使用本数据集与项目所引发的潜在损害承担任何责任。 由于部分数据的来源存在嵌套关联,可能存在来源标记不准确的问题。 ## 引用声明 如果使用本数据,请按照以下BibTex条目进行引用: @misc{zhao202514millionopensourcedistilled, title={1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training}, author={Han Zhao and Haotian Wang and Yiping Peng and Sitong Zhao and Xiaoyu Tian and Shuaiting Chen and Yunjie Ji and Xiangang Li}, year={2025}, eprint={2503.19633}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.19633}, }
提供机构:
maas
创建时间:
2025-03-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作