R1蒸馏模型数学推理能力测试集

Name: R1蒸馏模型数学推理能力测试集
Creator: maas
Published: 2026-05-16 13:18:46
License: 暂无描述

魔搭社区2026-05-16 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/evalscope/R1-Distill-Math-Test-v2

下载链接

链接失效反馈

官方服务：

资源简介：

> [!NOTE] > 该数据集兼容 v1.0.0版本 EvalScope，若使用旧版本 EvalScope v0.xx 等，请使用旧版[数据集](https://modelscope.cn/datasets/modelscope/R1-Distill-Math-Test) # R1蒸馏模型数学推理能力测试集共728道数学推理题目，包括： - [MATH-500](https://www.modelscope.cn/datasets/HuggingFaceH4/aime_2024)：一组具有挑战性的高中数学竞赛问题数据集，涵盖七个科目（如初等代数、代数、数论）共500道题。 - [GPQA-Diamond](https://modelscope.cn/datasets/AI-ModelScope/gpqa_diamond/summary)：该数据集包含物理、化学和生物学子领域的硕士水平多项选择题，共198道题。 - [AIME-2024](https://modelscope.cn/datasets/AI-ModelScope/AIME_2024)：美国邀请数学竞赛的数据集，包含30道数学题。更详细的使用方法详见：[DeepSeek-R1类模型数学能力测试最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html) ## 使用方法 ### 安装依赖安装[EvalScope](https://github.com/modelscope/evalscope)模型评估框架：由于框架在快速迭代中，接口可能不稳定，建议通过源码安装： ```bash git clone https://github.com/modelscope/evalscope.git cd evalscope/ pip install -e '.[app]' ``` ### 部署模型使用推理框架部署模型可以加速评测，下面是部署DeepSeek-R1-Distill-Qwen-1.5B模型的示例代码： **使用vLLM**: ```bash VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served-model-name DeepSeek-R1-Distill-Qwen-1.5B --trust_remote_code --port 8801 ``` **使用lmdeploy**： ```bash LMDEPLOY_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --model-name DeepSeek-R1-Distill-Qwen-1.5B --server-port 8801 ``` ### 评测模型运行下面的python代码评测DeepSeek-R1-Distill-Qwen-1.5B模型在数学推理数据集上的表现： ```python from evalscope import TaskConfig, run_task from evalscope.constants import EvalType task_cfg = TaskConfig( model='DeepSeek-R1-Distill-Qwen-1.5B', api_url='http://127.0.0.1:8801/v1/chat/completions', api_key='EMPTY', eval_type=EvalType.SERVICE, datasets=[ 'data_collection', ], dataset_args={ 'data_collection': { 'dataset_id': 'modelscope/R1-Distill-Math-Test' } }, eval_batch_size=64, # num of workers to seed requests generation_config={ 'max_tokens': 20000, # avoid exceed max length 'temperature': 0.6, 'top_p': 0.95, 'n': 5 # num of repeat for each prompt (note lmdeploy only support n=1) }, ) run_task(task_cfg=task_cfg) ``` 输出结果： **这里的计算指标是Pass@1，每个样本重复生成了5次，最终的评测结果是5次的平均值。** ```text 2025-02-10 20:42:56,050 - evalscope - INFO - dataset_level Report: +-----------+--------------+---------------+-------+ | task_type | dataset_name | average_score | count | +-----------+--------------+---------------+-------+ | math | math_500 | 0.7832 | 500 | | math | gpqa | 0.3434 | 198 | | math | aime24 | 0.2 | 30 | +-----------+--------------+---------------+-------+ ``` ### 结果可视化 EvalScope支持可视化结果，可以查看模型具体的输出。运行以下命令，可以启动可视化界面： ```bash evalscope app ``` 将输出如下链接内容： ```text * Running on local URL: http://0.0.0.0:7860 ``` 点击链接即可看到如下可视化界面，我们需要先选择评测报告然后点击加载： <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/distill/score.png" alt="alt text" width="100%"> 此外，选择对应的子数据集，我们也可以查看模型的输出内容，观察模型输出是否正确（或者是答案匹配是否存在问题）： <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/distill/detail.png" alt="alt text" width="100%"> ## 数据集构建方式使用[EvalScope](https://github.com/modelscope/evalscope)工具构建了该集合，参考[使用教程](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/collection/index.html)： ```python from evalscope.collections import WeightedSampler, CollectionSchema, DatasetInfo from evalscope.utils.io_utils import dump_jsonl_data schema = CollectionSchema(name='DeepSeekDistill', datasets=[ CollectionSchema(name='Math', datasets=[ DatasetInfo(name='math_500', weight=1, task_type='math', tags=['en'], args={'few_shot_num': 0}), DatasetInfo(name='gpqa', weight=1, task_type='math', tags=['en'], args={'subset_list': ['gpqa_diamond'], 'few_shot_num': 0}), DatasetInfo(name='aime24', weight=1, task_type='math', tags=['en'], args={'few_shot_num': 0}), ]) ]) # get the mixed data mixed_data = WeightedSampler(schema).sample(100000) # set a large number to ensure all datasets are sampled dump_jsonl_data(mixed_data, 'test.jsonl') ``` ## 下载方法 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"}

> [!NOTE] > 本数据集适配 EvalScope v1.0.0 版本；若需使用旧版 EvalScope（如 v0.xx 系列），请获取对应旧版[数据集](https://modelscope.cn/datasets/modelscope/R1-Distill-Math-Test) # R1蒸馏模型数学推理能力测试集本数据集共计728道数学推理题目，涵盖以下三个子数据集： - [MATH-500](https://www.modelscope.cn/datasets/HuggingFaceH4/aime_2024)：该数据集为一组高挑战性高中数学竞赛试题集合，覆盖初等代数、代数、数论等七个学科领域，总计包含500道试题。 - [GPQA-Diamond](https://modelscope.cn/datasets/AI-ModelScope/gpqa_diamond/summary)：该数据集包含物理、化学、生物学子领域的硕士水平多项选择题，共计198道题目。 - [AIME-2024](https://modelscope.cn/datasets/AI-ModelScope/AIME_2024)：该数据集为美国邀请数学竞赛试题集合，包含30道数学题目。更多详细使用方法，请参考[DeepSeek-R1系列模型数学能力测试最佳实践](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html) ## 使用方法 ### 依赖安装安装[EvalScope（模型评估框架）](https://github.com/modelscope/evalscope)：鉴于该框架仍处于快速迭代阶段，接口或存在不稳定性，推荐通过源码方式进行安装： bash git clone https://github.com/modelscope/evalscope.git cd evalscope/ pip install -e '.[app]' ### 模型部署通过推理框架部署模型可加速评测流程，以下为部署DeepSeek-R1-Distill-Qwen-1.5B模型的示例代码： **使用vLLM**: bash VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served-model-name DeepSeek-R1-Distill-Qwen-1.5B --trust_remote_code --port 8801 **使用lmdeploy**: bash LMDEPLOY_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --model-name DeepSeek-R1-Distill-Qwen-1.5B --server-port 8801 ### 模型评测执行以下Python代码，即可评测DeepSeek-R1-Distill-Qwen-1.5B模型在本数学推理数据集上的性能表现： python from evalscope import TaskConfig, run_task from evalscope.constants import EvalType task_cfg = TaskConfig( model='DeepSeek-R1-Distill-Qwen-1.5B', api_url='http://127.0.0.1:8801/v1/chat/completions', api_key='EMPTY', eval_type=EvalType.SERVICE, datasets=[ 'data_collection', ], dataset_args={ 'data_collection': { 'dataset_id': 'modelscope/R1-Distill-Math-Test' } }, eval_batch_size=64, # 用于生成请求的工作进程数 generation_config={ 'max_tokens': 20000, # 避免超出最大上下文长度 'temperature': 0.6, 'top_p': 0.95, 'n': 5 # 每个提示词的重复生成次数（注：lmdeploy仅支持n=1） }, ) run_task(task_cfg=task_cfg) 本次评测采用Pass@1作为计算指标，每个样本重复生成5次，最终评测结果为5次结果的平均值。 text 2025-02-10 20:42:56,050 - evalscope - INFO - 数据集级评测报告: +-----------+--------------+---------------+-------+ | 任务类型 | 数据集名称 | 平均得分 | 样本数 | +-----------+--------------+---------------+-------+ | math | math_500 | 0.7832 | 500 | | math | gpqa | 0.3434 | 198 | | math | aime24 | 0.2 | 30 | +-----------+--------------+---------------+-------+ ### 结果可视化 EvalScope支持评测结果可视化，可查看模型的具体输出。执行以下命令即可启动可视化界面： bash evalscope app 将输出如下链接地址： text * Running on local URL: http://0.0.0.0:7860 点击链接即可进入可视化界面，需先选择评测报告后点击加载按钮： <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/distill/score.png" alt="评分可视化界面" width="100%"> 此外，选择对应子数据集后，还可查看模型的具体输出内容，以校验模型输出正确性（或排查答案匹配相关问题）： <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/distill/detail.png" alt="详细结果查看界面" width="100%"> ## 数据集构建方式本数据集通过[EvalScope（模型评估框架）](https://github.com/modelscope/evalscope)工具构建，具体流程可参考[官方使用教程](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/collection/index.html)： python from evalscope.collections import WeightedSampler, CollectionSchema, DatasetInfo from evalscope.utils.io_utils import dump_jsonl_data schema = CollectionSchema(name='DeepSeekDistill', datasets=[ CollectionSchema(name='Math', datasets=[ DatasetInfo(name='math_500', weight=1, task_type='math', tags=['en'], args={'few_shot_num': 0}), DatasetInfo(name='gpqa', weight=1, task_type='math', tags=['en'], args={'subset_list': ['gpqa_diamond'], 'few_shot_num': 0}), DatasetInfo(name='aime24', weight=1, task_type='math', tags=['en'], args={'few_shot_num': 0}), ]) ]) # 获取混合后的数据集 mixed_data = WeightedSampler(schema).sample(100000) # 设置较大采样数以确保覆盖所有子数据集 dump_jsonl_data(mixed_data, 'test.jsonl') ## 下载方式 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"}

提供机构：

maas

创建时间：

2025-08-27

搜集汇总

数据集介绍