five

gsm8k-platinum

收藏
魔搭社区2026-03-31 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/gsm8k-platinum
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for GSM8K-Platinum [**🏆 Homepage**](http://platinum-bench.csail.mit.edu/) &nbsp;|&nbsp; [**📣 Blog**](https://gradientscience.org/gsm8k-platinum/) &nbsp;|&nbsp; [**🖥️ Code**](https://github.com/MadryLab/platinum-benchmarks/) &nbsp;|&nbsp; [**📖 Paper**](https://arxiv.org/abs/2502.03461) &nbsp;|&nbsp; [**🔍 Error Viewer**](http://platinum-bench.csail.mit.edu/inspect?model=o1-2024-12-17-high&dataset=gsm8k_full) ## Dataset Description - **Homepage:** http://platinum-bench.csail.mit.edu/ - **Repository:** https://github.com/MadryLab/platinum-benchmarks/ - **Paper:** https://arxiv.org/abs/2502.03461 - **Leaderboard:** http://platinum-bench.csail.mit.edu/ - **Point of Contact:** [Edward Vendrow](mailto:evendrow@mit.edu), [Joshua Vendrow](mailto:jvendrow@mit.edu) ### Dataset Summary _**GSM8K-Platinum**_ is a revised version of the full test set of GSM8K (Grade School Math 8K), a dataset of grade school math word problems, providing a more accurate assessment of mathematical reasoning capabilities To revise this dataset, we ran a variety of frontier models each individual example and manually examined any example for which at least one model made an error. We revise the labels of mislabeled examples, and remove any question that we determine to be poorly written (most often due to ambiguity in the problem statement). See our [paper](https://arxiv.org/abs/2502.03461) for further details on the revision process and our criteria for "bad" questions. Please refer to the original GSM8K dataset at: [https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k). <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/630b1e44cd26ad7f60d490e2/cAt7JFohPNFRYom5OMXTD.png" alt="Comparing GSM8K to GSM8K-Platinum" width=700 /> </p> ### Load the Dataset We keep the original data columns from `openai/gsm8k`, so `madrylab/gsm8k-platinum` can be used directly as a drop-in to replace the original gsm8k dataset. To load the dataset using HuggingFace `datasets`, you first need to `pip install datasets`, then run the following code: ```python from datasets import load_dataset ds = load_dataset("madrylab/gsm8k-platinum", "main", split="test") ``` ## Dataset structure ### Dataset Subsets & Cleaning Statistics | GSM8K (Test) | # Flagged by Models | # Rejected | # Re-labeled | # Verified | GSM8K-Platinum | ----- | ----- | ----- | ----- | ----- | ----- | 1319 | 219 | 110 | 10 | 99 | 1209 ### Data Instances An example from the **GSM8K-Platinum** looks as follows: ``` { 'question': 'A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?', 'answer': 'It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3', 'cleaning_status': 'consensus' } ``` ### Data Fields - **question** (`str`): The question to a grade school math problem. - **answer** (`str`): The full solution to the question. It contains multiple steps of reasoning with calculator annotations and the final numeric solution. - **cleaning_status** (`str`): One of: 1. *consensus*: all LLMs agreed with the label, so the example was not manually reviewed. 2. *verified*: the original target was manually verified to be correct. 3. *revised*: the answer is updated from the original answer. ### Prompt Example During our revision process, we used the following zero-shot prompt to query models with questions from GSM8K: ``` Solve the following math word problem. A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Think step-by-step. Then, provide the final answer as a single integer in the format "Answer: XXX" with no extra formatting. ``` The instruction to "think step-by-step" was excluded for reasoning models. ## Dataset Creation ### Curation Rationale GSM8K is one of a number of LLM benchmarks that contain significant label noise such as mislabeled or ambiguous questions. Due to this label noise, progress in these benchmarks often stalls before models actually achieve reliable performance on them. As a result, the comminuty often considers these benchmarks to be "saturated" and discards them too early, discouraging machine learning practictioners from ever striving to achieve proper reliability. In our [previous work](https://arxiv.org/abs/2502.03461), we revised a number of such benchmarks, including a 300-example subset of the GSM8K test set (these revised benchmarks are publically avaiable at: [https://huggingface.co/datasets/madrylab/platinum-bench](https://huggingface.co/datasets/madrylab/platinum-bench)). To further aid all who currently utilize GSM8K for evaluation (e.g., during the model development process), we have decided to revise the full GSM8K test set. By doing so, **GSM8K-Platinum** now serves as a natural and easy drop-in for the original GSM8K test set. ### Source Data and Attribution We sourced GSM8K from OpenAI's official huggingface repository: [https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k). This dataset is protected by the [MIT](https://github.com/openai/grade-school-math/blob/master/LICENSE) license. Please defer to the GSM8K dataset card for further details on their collection and annotation process. ## Additional Information ### Licensing Information The further annotations we provide are licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license. ### Citation Information Cite this dataset as well as the citation for the original GSM8K dataset. ``` @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, } ``` ``` @article{cobbe2021gsm8k, title={Training Verifiers to Solve Math Word Problems}, author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John}, journal={arXiv preprint arXiv:2110.14168}, year={2021} } ```

# GSM8K-Platinum 数据集卡片 [**🏆 主页**](http://platinum-bench.csail.mit.edu/) &nbsp;|&nbsp; [**📣 博客**](https://gradientscience.org/gsm8k-platinum/) &nbsp;|&nbsp; [**🖥️ 代码**](https://github.com/MadryLab/platinum-benchmarks/) &nbsp;|&nbsp; [**📖 论文**](https://arxiv.org/abs/2502.03461) &nbsp;|&nbsp; [**🔍 错误查看器**](http://platinum-bench.csail.mit.edu/inspect?model=o1-2024-12-17-high&dataset=gsm8k_full) ## 数据集描述 - **主页:** http://platinum-bench.csail.mit.edu/ - **代码仓库:** https://github.com/MadryLab/platinum-benchmarks/ - **论文:** https://arxiv.org/abs/2502.03461 - **排行榜:** http://platinum-bench.csail.mit.edu/ - **联系人:** [Edward Vendrow](mailto:evendrow@mit.edu), [Joshua Vendrow](mailto:jvendrow@mit.edu) ### 数据集概览 **GSM8K-Platinum** 是GSM8K(Grade School Math 8K,即8K道中小学数学应用题数据集)完整测试集的修订版本,可更精准地评估大语言模型(Large Language Model,LLM)的数学推理能力。 本次数据集修订流程如下:我们针对每个样本运行多款前沿大语言模型,人工检查至少有一个模型出错的样本;修正标注错误的样本标签,并移除我们认定表述欠佳的问题(这类问题大多源于题干歧义)。有关修订流程与「劣质」问题判定标准的详细信息,请参阅我们的[论文](https://arxiv.org/abs/2502.03461)。 原始GSM8K数据集可访问:[https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)。 <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/630b1e44cd26ad7f60d490e2/cAt7JFohPNFRYom5OMXTD.png" alt="GSM8K与GSM8K-Platinum对比" width=700 /> </p> ### 数据集加载 我们保留了`openai/gsm8k`的原始数据列,因此`madrylab/gsm8k-platinum`可直接作为原GSM8K数据集的无缝替换集使用。 若需使用HuggingFace `datasets`库加载该数据集,请先执行`pip install datasets`,再运行以下代码: python from datasets import load_dataset ds = load_dataset("madrylab/gsm8k-platinum", "main", split="test") ## 数据集结构 ### 数据集子集与清洗统计 | GSM8K (测试集) | 被模型标记的样本数 | 被剔除的样本数 | 被重新标注的样本数 | 被验证的样本数 | GSM8K-Platinum | ----- | ----- | ----- | ----- | ----- | ----- | 1319 | 219 | 110 | 10 | 99 | 1209 ### 数据实例 一条来自**GSM8K-Platinum**的样本示例如下: { 'question': 'A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?', 'answer': 'It takes 2/2=<<2/2=1>>1 bolt of white fiber So the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric #### 3', 'cleaning_status': 'consensus' } ### 数据字段 - **question** (`str`): 中小学数学应用题的题干。 - **answer** (`str`): 该问题的完整解题过程,包含多步推理与计算器标注(形如`<<计算式>>`),以及最终的数值解。 - **cleaning_status** (`str`): 包含以下三类: 1. *consensus*: 所有大语言模型均与原标注一致,无需人工审核。 2. *verified*: 原标注经人工验证正确。 3. *revised*: 答案已从原标注更新。 ### 提示示例 在本次修订过程中,我们使用以下零样本(Zero-shot)提示词向模型查询GSM8K的题目: Solve the following math word problem. A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Think step-by-step. Then, provide the final answer as a single integer in the format "Answer: XXX" with no extra formatting. 对于具备原生分步推理能力的模型,我们会省略「逐步思考」的指令。 ## 数据集构建 ### 构建逻辑 GSM8K是众多存在显著标注噪声的大语言模型基准数据集之一,这类噪声包括标注错误或题干歧义。由于这类标注噪声的存在,相关基准的性能进展往往会在模型尚未实现可靠性能前就陷入停滞。因此,学界常将这类基准视为「已饱和」并过早弃用,这阻碍了机器学习从业者追求真正可靠的模型性能。 在我们的[前期工作](https://arxiv.org/abs/2502.03461)中,我们已修订了多个此类基准,包括GSM8K测试集中的300个样本子集(这些修订后的基准可在[https://huggingface.co/datasets/madrylab/platinum-bench](https://huggingface.co/datasets/madrylab/platinum-bench)公开获取)。为进一步助力所有当前使用GSM8K进行模型评估的研究者(例如在模型开发阶段),我们决定修订完整的GSM8K测试集。至此,**GSM8K-Platinum**现已成为原GSM8K测试集的自然且易用的无缝替换集。 ### 源数据与归属声明 我们从OpenAI官方HuggingFace仓库获取了GSM8K数据集:[https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)。该数据集遵循[MIT](https://github.com/openai/grade-school-math/blob/master/LICENSE)许可协议。有关其收集与标注流程的详细信息,请参阅GSM8K数据集卡片。 ## 附加信息 ### 许可信息 我们新增的标注内容遵循[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)许可协议。 ### 引用信息 请同时引用本数据集与原始GSM8K数据集的引用信息。 @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, } @article{cobbe2021gsm8k, title={Training Verifiers to Solve Math Word Problems}, author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John}, journal={arXiv preprint arXiv:2110.14168}, year={2021} }
提供机构:
maas
创建时间:
2025-03-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作