AM-DeepSeek-Distilled-40M
收藏魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/a-m-team/AM-DeepSeek-Distilled-40M
下载链接
链接失效反馈官方服务:
资源简介:
**For more open-source datasets, models, and methodologies, please visit our [GitHub repository](https://github.com/a-m-team/a-m-models) and paper: [DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training](https://huggingface.co/papers/2504.17565).**
**Due to certain constraints, we are only able to open-source a subset of the complete dataset.**
## Model Training Performance based on our complete dataset
On AIME 2024, our 72B model achieved a score of **79.2** using **only supervised fine-tuning (SFT)**. The 32B model reached 75.8 and improved further to **77.9** through annealing training, approaching state-of-the-art open-source performance.

Long reasoning processes have demonstrated significant effectiveness in enhancing model performance across domains such as mathematics,
code generation, and reasoning. Recent studies have highlighted that training outcomes are markedly influenced by task difficulty.
Although many existing methods utilize a large language model (LLM) to rate the task difficulty,
our empirical analysis reveals that difficulty ratings are substantially biased due to variations among models,
prompting strategies, and subjective preferences. Consequently,
the resulting **absolute difficulty scores are unreliable and challenging to employ effectively in practical applications**.
To mitigate this bias, we propose aggregating multiple responses from diverse models to compute pass rates,
thus determining task difficulty based on relative performance differences. Specifically, we compute corresponding pass rates using three different size models (we believe that models of different sizes have different capabilities) to measure the difficulty of the data.
[AM-DeepSeek-Distilled-40M](https://huggingface.co/datasets/a-m-team/AM-MultiModel-Difficulty-40M) is a **large-scale, unbiased difficulty-graded reasoning dataset constructed by the [AM Team](https://huggingface.co/a-m-team)**. This dataset contains approximately **3.34 million** unique queries, totaling **40 million** model-generated responses, sourced from numerous high-quality open-source datasets covering five major categories: **code, math, science, instruction-following, and other general reasoning tasks**.
Each query is paired with responses distilled from three different-sized models ([DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), and [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)). For each query, each model generated four sampled responses, resulting in the comprehensive dataset mentioned above. Difficulty ratings are provided based on comparative success rates across these differently-sized models, significantly reducing bias inherent in difficulty grading derived from a single model.
For tasks in the categories of code, math, science, and instruction-following, explicit success rates for each model size are provided. For tasks categorized under \"other,\" we supply [Decision-Tree-Reward-Llama-3.1-8B](https://huggingface.co/RLHFlow/Decision-Tree-Reward-Llama-3.1-8B) scores to reflect task performance.
This design uniquely enables the selection of data subsets tailored to specific difficulty requirements and provides a robust foundation for various training paradigms such as Supervised Fine Tuning, Preference Learning (e.g., DPO), Reinforcement Learning approaches (e.g., PPO, GRPO), and more.
We are releasing this expansive dataset of 40 million responses to the research community, aiming to foster further advancements in developing powerful and sophisticated Large Language Models (LLMs). We sincerely thank the open-source community for their invaluable contributions, without which this ambitious dataset would not have been possible.
## Unified Format
#### file structure
```bash
.
|-- code_r1_1pass.jsonl
|-- code_r1_2pass.jsonl
|-- code_r1_3pass.jsonl
|-- code_r1_4pass.jsonl
|-- math_r1_1pass.jsonl
|-- math_r1_2pass.jsonl
|-- math_r1_3pass.jsonl
|-- math_r1_4pass.jsonl
|-- ... ...
```
There are 3 models in total, each model has 4 results per entry, making a total of 12 .jsonl files.
#### sample
```bash
{
"question": "Write a function that takes in a list of tuples, where each tuple contains a student's name and their score in a particular subject. The function should return a new list of tuples containing the top N students based on their scores, sorted in descending order.",
"answer": "<think>Okay, let's see. I need to write a function that takes a list of tuples, each with a student's name and their score, and return the top N students sorted by their scores in descending order. Hmm.\n\nFirst, I should figure out how to sort the list of tuples based on the scores. Oh right, each tuple is (name, score), so the key for sorting would be the second element of the tuple. And we need to sort in descending order, so reverse=True.\n\nWait, right. So the first step is to sort the list. Then, after sorting, we take the top N elements. But wait, what if N is larger than the length of the list? Oh, probably the function should handle that by returning as many as possible. Like, if there are 5 students and N is 10, return all 5.\n\nSo the steps are:\n\n1. Sort the input list in descending order of scores.\n2. Slice the sorted list to get the first N elements.\n3. Return that sliced list as the result.\n\nLet me think about the sorting. The default sort in Python for tuples sorts by the first element, then second. But here, we need to sort by the second element (the score). So I need a custom key function. The key for each tuple should be the score, which is the element at index 1. And we want to sort in reverse order so that higher scores come first.\n\nSo when sorting, the key is lambda x: x[1], and reverse=True.\n\nWait, but when sorting in descending order, the highest scores will be first. So after sorting, the list is already in the right order. Then taking the first N elements will give the top N.\n\nSo putting it all together:\n\nFunction steps:\n\n- Take parameters: list of tuples (students) and integer N.\n- Sort the students list using the score (descending).\n- Return the first N elements.\n\nEdge cases: What if N is 0? Then return empty list. What if N is negative? The function probably expects N to be a positive integer, but maybe the code should handle that by returning an empty list. Or maybe the function is designed with the assumption that N is a positive integer. The problem statement doesn't specify, so perhaps we can assume that N is a non-negative integer, but in code, take the max between N and 0? Or maybe the code doesn't handle that and relies on the user to pass correct N.\n\nBut according to the problem statement, the function should return the top N students. So perhaps the code should handle N being larger than the list length. For example, if the list has 3 students and N is 5, return all 3.\n\nSo the code would be:\n\ndef top_students(students, N):\n # sort the students by score in descending order\n sorted_students = sorted(students, key=lambda x: x[1], reverse=True)\n # return the first N elements\n return sorted_students[:N]\n\nWait, but the problem says to return a new list of tuples. So this code would do that. Let's test with some examples.\n\nExample input: [(\"Alice\", 85), (\"Bob\", 92), (\"Charlie\", 78)], N=2. Sorted by score would be Bob (92), Alice (85), Charlie (78). So the top 2 are Bob and Alice. Return those two tuples in order.\n\nAnother test case: if N is 0, return empty list. If N is 3, same as the list, return all. If the list is empty, return empty.\n\nSo the code seems correct.\n\nI think that's the solution. Let's code that.\n</think>\n<answer>\n```python\ndef top_students(students, N):\n sorted_students = sorted(students, key=lambda x: x[1], reverse=True)\n return sorted_students[:N]\n```\n</answer>",
"question_source": "opencoder",
"answer_source": "am_distill_r1",
"category": "code",
"ground_truth": null,
"test_case": "\"{\\\"call_type\\\": \\\"assert\\\", \\\"fn_name\\\": \\\"top_students\\\", \\\"assert_case\\\": [\\\"assert top_students([(\\\\\\\"Alice\\\\\\\", 90), (\\\\\\\"Bob\\\\\\\", 85), (\\\\\\\"Charlie\\\\\\\", 95), (\\\\\\\"David\\\\\\\", 92)], 3) == [(\\\\\\\"Charlie\\\\\\\", 95), (\\\\\\\"David\\\\\\\", 92), (\\\\\\\"Alice\\\\\\\", 90)]\\\", \\\"assert top_students([(\\\\\\\"Alice\\\\\\\", 100), (\\\\\\\"Bob\\\\\\\", 90), (\\\\\\\"Charlie\\\\\\\", 80), (\\\\\\\"David\\\\\\\", 70)], 1) == [(\\\\\\\"Alice\\\\\\\", 100)]\\\", \\\"assert top_students([(\\\\\\\"Alice\\\\\\\", 75), (\\\\\\\"Bob\\\\\\\", 60), (\\\\\\\"Charlie\\\\\\\", 80), (\\\\\\\"David\\\\\\\", 85)], 2) == [(\\\\\\\"David\\\\\\\", 85), (\\\\\\\"Charlie\\\\\\\", 80)]\\\"]}\"",
"instruction_constrain": null,
"pass_rate_r1": 1.0,
"pass_rate_7b": 0.5,
"pass_rate_1.5b": 0.5,
"verify_score": 1.0,
"ppl": 2.484375,
"model_name": "r1"
}
```
#### Dedicated fields
- **`question`**: User input
- **`answer`**: Model-generated response
- **`question_source`**: Source of the question
- **`answer_source`**: am_distill_1.5b, am_distill_7b, or am_distill_r1
- **`category`**: math, code, science, instruction follow, or other
- **`ground_truth`**: Correct answer to the question
- **`test_case`**: Test cases if the question type is code; otherwise None
- **`instruction_constrain`**: Instruction constraint if the query type is instruction follow; otherwise None
- **`pass_rate_r1`**: Pass rate of DeepSeek-R1
- **`pass_rate_7b`**: Pass rate of DeepSeek-R1-Distill-Qwen-7B
- **`pass_rate_1.5b`**: Pass rate of DeepSeek-R1-Distill-Qwen-1.5B
- **`verify_score`**: The correctness score of the current model’s responses
- **`ppl`**: Perplexity (PPL) calculated by the 7b model
- **`model_name`**: The model used to generate the answer
## How we get different model's pass rate ?
**1. Obtaining the verify_score for current data**
Different types of data have different methods for calculating verify_score:
- For mathematics, we use [Math-Verify](https://github.com/huggingface/Math-Verify).
- For code, sandbox testing is conducted against test cases.
- For instruction follow, testing is conducted based on instruction_constrain.
- For science, the model evaluates the match between generated results and correct answers.
- For other types, scoring is done using a reward model.
We selected **very strict score validation criteria for evaluating the model responses**.
For math, code, and instruction follow, we consider a model response correct if the score exceeds 0.99 (assigned a value of 1; similarly for the following cases).
For science, we evaluate the match between the generated results and the correct answers using the [Qwen2.5-7B-Instruct model](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), with scores ranging from [0, 5]; responses are considered correct when the score exceeds 4.99.
For other data types, we first use Decision-Tree-Reward-Llama-3.1-8B to obtain coherence, correctness, and helpfulness values. We then calculate a reward score using the formula:
$$reward = (coherence + correctness + helpfulness) / 12$$
Based on observations, we regard model responses as high-quality when the reward score exceeds 0.7.
**2. Calculating the pass rate of the model**
$$ pass\_rate = \frac{∑_{i=1}^{n} verify\_score_i}{n}, n=4 $$
## Data statistics
#### Instruction sources
| Source | Nums |
| --- | --- |
| OpenHermes-2.5 | 762544 |
| NuminaMath | 469071 |
| InfinityInstruct | 368767 |
| tulu-3-sft-mixture | 341249 |
| natural_reasoning | 253676 |
| KodCode | 241827 |
| BigMathVerified | 218977 |
| MetaMathQA | 130915 |
| opencoder | 91735 |
| ... | ... |
Note: A total of **30 data sources** are included.
#### Data Type Distribution
| Category | Nums |
| --- | --- |
| math | 828639 |
| science | 91637 |
| code | 489363 |
| instruction follow | 76520 |
| other | 1850985 |
#### Pass Rate Distribution
Below, we present the pass rate distributions of the three models across four categories of data: math, code, science, and instruction following.

**We can’t wait to see more developers use this dataset to enhance model capabilities and further contribute feedback to the open-source community.** 🤗
## Limitation and Usage Limits
Developers should strictly limit the use of this project’s open-sourced code, data, models, and related artifacts to research purposes only. Commercial exploitation and any use that may potentially cause harm are explicitly prohibited.
The content within this dataset does not represent the views, interests, or beliefs of any individual or organization, nor does it imply endorsement or claims related to any groups. The creators of this project accept no responsibility for any potential damage or harm resulting from the use of this dataset or associated resources.
Each unique question in the dataset appears repeatedly 12 times with varying answers. Users should be mindful of this duplication during utilization.
Additionally, we have observed potential inaccuracies in some of the `ground_truth` labels. Users are encouraged to carefully review the data, as there might be better approaches for calculating metrics like `verify_score` and `pass_rate`.
## Citation
If you use this data, please cite with the following BibTex entry:
```
@misc{tian2025deepdistillenhancingllmreasoning,
title={DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training},
author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Yunjie Ji and Han Zhao and Xiangang Li},
year={2025},
eprint={2504.17565},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.17565},
}
```
## 更多开源数据集、模型与方法论,请访问我们的 [GitHub仓库](https://github.com/a-m-team/a-m-models) 与论文:[DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training](https://huggingface.co/papers/2504.17565)。
**受限于部分条件,我们仅能开源完整数据集的子集。**
## 基于完整数据集的模型训练性能
在AIME 2024赛事中,我们的72B模型仅通过**监督微调(Supervised Fine-Tuning,SFT)**便取得了79.2分的成绩。32B模型则达到75.8分,经过退火训练后进一步提升至77.9分,接近当前最优的开源模型性能。

长推理流程在提升数学、代码生成与推理等多个领域的模型性能方面已被证实具有显著效果。近期研究表明,训练效果受任务难度的影响极为显著。尽管现有诸多方法均借助大语言模型(Large Language Model,LLM)对任务难度进行评分,但我们的实证分析显示,由于模型、提示策略与主观偏好的差异,难度评分存在显著偏差。由此得到的**绝对难度分数并不可靠,难以在实际应用中有效使用**。为缓解该偏差,我们提出聚合不同模型的多轮响应以计算通过率,进而基于相对性能差异确定任务难度。具体而言,我们使用三种不同参数量的模型(我们认为不同参数量的模型具备不同的能力)计算对应的通过率,以此衡量数据的难度。
[AM-DeepSeek-Distilled-40M](https://huggingface.co/datasets/a-m-team/AM-MultiModel-Difficulty-40M) 是由**[AM团队](https://huggingface.co/a-m-team)**构建的**大规模、无偏的难度分级推理数据集**。该数据集包含约334万条唯一查询,总计4000万条模型生成的响应,其数据源来自覆盖五大类别的大量优质开源数据集:**代码、数学、科学、指令遵循以及其他通用推理任务**。
每条查询均与三种不同参数量模型生成的精炼响应配对([DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)、[DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) 与 [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1))。针对每条查询,每个模型均生成4条采样响应,由此构成前文提及的完整数据集。难度评分基于这些不同参数量模型的相对成功率得出,大幅降低了单一模型生成难度分级时固有的偏差。
针对代码、数学、科学与指令遵循类别的任务,我们提供了各参数量模型的明确成功率。对于归类为「其他」的任务,我们采用[Decision-Tree-Reward-Llama-3.1-8B](https://huggingface.co/RLHFlow/Decision-Tree-Reward-Llama-3.1-8B) 评分来反映任务性能。
该设计可支持根据特定难度需求定制选取数据子集,同时为监督微调、偏好学习(如DPO)、强化学习方法(如PPO、GRPO)等多种训练范式提供坚实的基础。
我们将这个包含4000万条响应的大规模数据集开源发布给学术社区,旨在推动高性能、高精度大语言模型(LLMs)的进一步发展。我们衷心感谢开源社区的宝贵贡献,若无这些贡献,本大型数据集的构建将无法实现。
## 统一格式
#### 文件结构
bash
.
|-- code_r1_1pass.jsonl
|-- code_r1_2pass.jsonl
|-- code_r1_3pass.jsonl
|-- code_r1_4pass.jsonl
|-- math_r1_1pass.jsonl
|-- math_r1_2pass.jsonl
|-- math_r1_3pass.jsonl
|-- math_r1_4pass.jsonl
|-- ... ...
该数据集共包含3个模型,每个模型针对每条数据生成4条结果,总计12个.jsonl文件。
#### 数据样例
bash
{
"question": "Write a function that takes in a list of tuples, where each tuple contains a student's name and their score in a particular subject. The function should return a new list of tuples containing the top N students based on their scores, sorted in descending order.",
"answer": "<think>Okay, let's see. I need to write a function that takes a list of tuples, each with a student's name and their score, and return the top N students sorted by their scores in descending order. Hmm.
First, I should figure out how to sort the list of tuples based on the scores. Oh right, each tuple is (name, score), so the key for sorting would be the second element of the tuple. And we need to sort in descending order, so reverse=True.
Wait, right. So the first step is to sort the list. Then, after sorting, we take the top N elements. But wait, what if N is larger than the length of the list? Oh, probably the function should handle that by returning as many as possible. Like, if there are 5 students and N is 10, return all 5.
So the steps are:
1. Sort the input list in descending order of scores.
2. Slice the sorted list to get the first N elements.
3. Return that sliced list as the result.
Let me think about the sorting. The default sort in Python for tuples sorts by the first element, then second. But here, we need to sort by the second element (the score). So I need a custom key function. The key for each tuple should be the score, which is the element at index 1. And we want to sort in reverse order so that higher scores come first.
So when sorting, the key is lambda x: x[1], and reverse=True.
Wait, but when sorting in descending order, the highest scores will be first. So after sorting, the list is already in the right order. Then taking the first N elements will give the top N.
So putting it all together:
Function steps:
- Take parameters: list of tuples (students) and integer N.
- Sort the students list by score in descending order.
- Return the first N elements.
Edge cases: What if N is 0? Then return empty list. What if N is negative? The function probably expects N to be a positive integer, but maybe the code should handle that by returning an empty list. Or maybe the function is designed with the assumption that N is a positive integer. The problem statement doesn't specify, so perhaps we can assume that N is a non-negative integer, but in code, take the max between N and 0? Or maybe the code doesn't handle that and relies on the user to pass correct N.
But according to the problem statement, the function should return the top N students. So perhaps the code should handle N being larger than the list length. For example, if the list has 3 students and N is 5, return all 3.
So the code would be:
def top_students(students, N):
# sort the students by score in descending order
sorted_students = sorted(students, key=lambda x: x[1], reverse=True)
# return the first N elements
return sorted_students[:N]
Wait, but the problem says to return a new list of tuples. So this code would do that. Let's test with some examples.
Example input: [("Alice", 85), ("Bob", 92), ("Charlie", 78)], N=2. Sorted by score would be Bob (92), Alice (85), Charlie (78). So the top 2 are Bob and Alice. Return those two tuples in order.
Another test case: if N is 0, return empty list. If N is 3, same as the list, return all. If the list is empty, return empty.
So the code seems correct.
I think that's the solution. Let's code that.
</think>
<answer>
python
def top_students(students, N):
sorted_students = sorted(students, key=lambda x: x[1], reverse=True)
return sorted_students[:N]
</answer>",
"question_source": "opencoder",
"answer_source": "am_distill_r1",
"category": "code",
"ground_truth": null,
"test_case": ""{\"call_type\": \"assert\", \"fn_name\": \"top_students\", \"assert_case\": [\"assert top_students([(\\\"Alice\\\", 90), (\\\"Bob\\\", 85), (\\\"Charlie\\\", 95), (\\\"David\\\", 92)], 3) == [(\\\"Charlie\\\", 95), (\\\"David\\\", 92), (\\\"Alice\\\", 90)]\", \"assert top_students([(\\\"Alice\\\", 100), (\\\"Bob\\\", 90), (\\\"Charlie\\\", 80), (\\\"David\\\", 70)], 1) == [(\\\"Alice\\\", 100)]\", \"assert top_students([(\\\"Alice\\\", 75), (\\\"Bob\\\", 60), (\\\"Charlie\\\", 80), (\\\"David\\\", 85)], 2) == [(\\\"David\\\", 85), (\\\"Charlie\\\", 80)]\"]}"",
"instruction_constrain": null,
"pass_rate_r1": 1.0,
"pass_rate_7b": 0.5,
"pass_rate_1.5b": 0.5,
"verify_score": 1.0,
"ppl": 2.484375,
"model_name": "r1"
}
#### 专属字段
- **`question`**: 用户输入
- **`answer`**: 模型生成的响应
- **`question_source`**: 问题来源
- **`answer_source`**: 可为am_distill_1.5b、am_distill_7b或am_distill_r1
- **`category`**: 取值为math、code、science、instruction follow或other
- **`ground_truth`**: 问题的标准答案
- **`test_case`**: 若问题为代码类则包含测试用例,否则为None
- **`instruction_constrain`**: 若查询为指令遵循类则包含指令约束,否则为None
- **`pass_rate_r1`**: DeepSeek-R1的通过率
- **`pass_rate_7b`**: DeepSeek-R1-Distill-Qwen-7B的通过率
- **`pass_rate_1.5b`**: DeepSeek-R1-Distill-Qwen-1.5B的通过率
- **`verify_score`**: 当前模型响应的正确性评分
- **`ppl`**: 由7B模型计算得到的困惑度(Perplexity,PPL)
- **`model_name`**: 用于生成响应的模型名称
## 不同模型通过率的计算方式
**1. 当前数据verify_score的获取方式**
不同类型的数据采用不同的verify_score计算方法:
- 针对数学类任务,我们使用[Math-Verify](https://github.com/huggingface/Math-Verify)工具。
- 针对代码类任务,我们通过沙箱测试验证测试用例。
- 针对指令遵循类任务,我们基于`instruction_constrain`进行测试验证。
- 针对科学类任务,由模型评估生成结果与标准答案的匹配度。
- 针对其他类型任务,我们采用奖励模型进行评分。
我们为模型响应的评分设定了**极为严格的验证标准**。针对数学、代码与指令遵循类任务,若评分超过0.99,则认为模型响应正确(记为1,后续类别同理)。针对科学类任务,我们使用[Qwen2.5-7B-Instruct模型](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)评估生成结果与标准答案的匹配度,评分范围为[0,5];当评分超过4.99时,认为模型响应正确。针对其他类型的数据,我们首先通过Decision-Tree-Reward-Llama-3.1-8B模型获取连贯性、正确性与有用性的分值,随后通过以下公式计算奖励评分:
$$reward = (coherence + correctness + helpfulness) / 12$$
根据观测结果,当奖励评分超过0.7时,我们认为模型响应为高质量内容。
**2. 模型通过率的计算方式**
$$ pass\_rate = frac{∑_{i=1}^{n} verify\_score_i}{n}, n=4 $$
其中$n=4$代表针对每条查询,每个模型生成的采样响应总数。
## 数据统计
#### 指令来源
| 数据源 | 数量 |
| --- | --- |
| OpenHermes-2.5 | 762544 |
| NuminaMath | 469071 |
| InfinityInstruct | 368767 |
| tulu-3-sft-mixture | 341249 |
| natural_reasoning | 253676 |
| KodCode | 241827 |
| BigMathVerified | 218977 |
| MetaMathQA | 130915 |
| opencoder | 91735 |
| ... | ... |
注:本次开源数据集共计包含30个数据源。
#### 数据类型分布
| 类别 | 数量 |
| --- | --- |
| math | 828639 |
| science | 91637 |
| code | 489363 |
| instruction follow | 76520 |
| other | 1850985 |
#### 通过率分布
如下所示,我们展示了三种模型在数学、代码、科学与指令遵循四类任务上的通过率分布。

**我们热切期待更多开发者使用本数据集提升模型性能,并为开源社区贡献反馈意见。** 🤗
## 使用限制与免责声明
开发者应严格将本项目开源的代码、数据、模型及相关制品仅用于学术研究目的。明确禁止商业用途及任何可能造成危害的使用行为。本数据集内的内容不代表任何个人或组织的观点、利益或立场,也不暗示对任何群体的认可或主张。本项目的创建者不对因使用本数据集或相关资源所导致的任何潜在损害或伤害承担责任。数据集中的每条唯一查询会以不同响应重复出现12次,使用者需注意该重复情况。此外,我们发现部分`ground_truth`标签存在潜在不准确之处,鼓励使用者仔细审核数据,因为可能存在更优的方式来计算`verify_score`与`pass_rate`等指标。
## 引用声明
若您使用本数据集,请通过以下BibTex条目进行引用:
@misc{tian2025deepdistillenhancingllmreasoning,
title={DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training},
author={Xiaoyu Tian and Sitong Zhao and Haotian Wang and Shuaiting Chen and Yiping Peng and Yunjie Ji and Han Zhao and Xiangang Li},
year={2025},
eprint={2504.17565},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.17565},
}
提供机构:
maas
创建时间:
2025-04-28



