platinum-bench

Name: platinum-bench
Creator: maas
Published: 2025-11-07 16:22:49
License: 暂无描述

魔搭社区2025-11-07 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/madrylab/platinum-bench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PlatinumBench [**🏆 Leaderboard**](http://platinum-bench.csail.mit.edu/)  |  [**🖥️ Code**](https://github.com/MadryLab/platinum-benchmarks/)  |  [**📖 Paper**](https://arxiv.org/abs/2502.03461)  |  [**🔍 Error Viewer**](http://platinum-bench.csail.mit.edu/inspect)  |  [**🛠 Changelog**](https://github.com/MadryLab/platinum-benchmarks/blob/main/CHANGELOG.md) ## Dataset Description - **Homepage:** http://platinum-bench.csail.mit.edu/ - **Repository:** https://github.com/MadryLab/platinum-benchmarks/ - **Paper:** https://arxiv.org/abs/2502.03461 - **Leaderboard:** http://platinum-bench.csail.mit.edu/ - **Point of Contact:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu) ### Dataset Summary _**Platinum Benchmarks**_ are benchmarks that are are carefully curated to minimize label errors and ambiguity, allowing us to measure reliability of models. This dataset contains fifteen platinum benchmarks created by manually revising questions from existing datasets (see the github repo for details on accessing our revised subset of VQA). To revise each benchmark, we ran a variety of frontier models on individual examples and manually re-annotated any example for which at least one model made an error. See the paper for further details on the revision process. ### Load the Dataset To load the dataset using HuggingFace `datasets`, you first need to `pip install datasets`, then run the following code: ```python from datasets import load_dataset ds = load_dataset("madrylab/platinum-bench", name="gsm8k", split="test") # or another subset ds = ds.filter(lambda x: x['cleaning_status'] != 'rejected') # filter out rejected questions ``` ## Dataset structure ### Dataset Subsets & Cleaning Statistics Below we list each of the platinum benchmarks with the number of examples in each benchmark that we kept via consensus, revised, verified, or rejected. See "Data Fields" for a description of what each cleaning status means. | | Included | | | | Excluded | | ----- | ----- | ----- | ----- | ----- | ----- | Dataset | **# Included** | Consensus | Revised | Verified | Rejected SingleOp (Platinum) | **150** | 142 | 0 | 8 | 9 SingleEq (Platinum) | **100** | 87 | 0 | 13 | 9 MultiArith (Platinum) | **170** | 164 | 3 | 3 | 4 SVAMP (Platinum) | **265** | 220 | 3 | 42 | 35 GSM8K (Platinum) | **268** | 221 | 1 | 46 | 32 MMLU High‑School Math (Platinum) | **267** | 105 | 0 | 162 | 3 Logic Ded. 3-Obj (Platinum) | **200** | 159 | 0 | 41 | 0 Object Counting (Platinum) | **190** | 57 | 0 | 133 | 10 Navigate (Platinum) | **200** | 118 | 0 | 82 | 0 TabFact (Platinum) | **169** | 56 | 3 | 110 | 31 HotPotQA (Platinum) | **181** | 48 | 88 | 45 | 69 SQUAD2.0 (Platinum) | **161** | 69 | 43 | 49 | 89 DROP (Platinum) | **209** | 27 | 179 | 3 | 41 Winograd WSC (Platinum) | **195** | 77 | 0 | 118 | 5 VQA (Platinum) | **242** | 0 | 242 | 0 | 358 ### Data Instances We accessed each of the fourteen original natural language benchmarks that we revised from their respective huggingface repositories, and each benchmark had its own per-instance data fields/columns. We have standardized these benchmarks by providing pre-constructed prompts for each dataset (under 'platinum_prompt'). Each prompt template automatically formats the relevant dataset columns into a consistent structure. You can use these standardized prompts directly, but we include the original dataset columns for those interested in their own prompting, or to seamlessly subtitute our revised benchmarks for the original versions. For VQA, we source images and annotataions from their [official website](https://visualqa.org/download.html), and reference images by their image path in the original downloaded directory format (see our GitHub repository for additional details). An example from the PlatinumBench GSM8K subset looks as follows: ``` {'cleaning_status': 'consensus', 'platinum_prompt': 'Solve the following math word problem.\n\nA robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?\n\nThink step-by-step. Then, provide the final answer as a single integer in the format "Answer: XXX" with no extra formatting.', 'platinum_prompt_no_cot': 'Solve the following math word problem.\n\nA robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?\n\nThen, provide the final answer as a single integer in the format "Answer: XXX" with no extra formatting.', 'platinum_target': ['3'], 'platinum_parsing_strategy': 'math', 'original_target': ['3'] 'question': 'A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?', 'answer': 'It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3'} ``` ### Data Fields - **cleaning_status** (`str`): One of: 1. *consensus*: all LLMs agreed with the label, so the example was not manually reviewed (`platinum_target` == `original_target` by default). 2. *verified*: the original target was maually verified to be correct (`platinum_target` == `original_target`). 3. *revised*: the label is updated from the original label (`platinum_target` != `original_target`). 4. *rejected*: the example is removed due to issues such as ambiguity. - **platinum_prompt** (`str`): A chain-of-thought question prompt that can be directly asked to a language model. This is constructed from fields in the original dataset. - **platinum_prompt_no_cot** (`str`): The same prompt, but without explicity chain-of-thought instructions. This is used for models like `o1` that don't need chain-of-thought prompting. - **platinum_target** (`List[str]`): The list of all correct answers for the question. In most cases there is just one correct answer. - **original_target** (`str`): The original target provided in the dataset. This is can be different from the platinum target if it is incorrect. - **platinum_parsing_strategy** (`str`): The parser that should be used to parse the LLM answer. Refer to the provided code. - **image_path** (`str`): Only included for VQA. The image path from which to source the relevant image, such as: `'val2014/COCO_val2014_000000304481.jpg`. - We also incude all the original dataset columns after these ones. > [!NOTE] > This HuggingFace dataset includes rejected questions that are not used for evaluation. To use only questions that we include in our platinum benchmarks, make sure to filter these out: > >`ds = ds.filter(lambda x: x['cleaning_status'] != 'rejected')` ### Prompt Example Here is an example of the standardized prompt we provide for a question from MultiArith: ``` Solve the following math word problem. At the schools book fair Sam bought 13 adventure books and 17 mystery books. If 15 of the books were used, how many new books did he buy? Think step-by-step. Then, provide the final answer as a single number in the format "Answer: XXX" with no extra formatting. ``` The specific prompt template and parsing strategy depends on the model, although many of them are common between datasets. ## Dataset Creation ### Curation Rationale Many current LLM benchmarks are riddled with label noise such as mislabeled or ambiguous questions. Due to this label noise, progress in these benchmarks often stalls before models actually achieve reliable performance on them. As a result, the comminuty often considers these benchmarks to be "saturated" and discards them too early, discouraging machine learning practictioners from ever striving to achieve proper reliability. As a first step towards addressing this gap in benchmarking practices, we revise samples from fifteen "saturated" benchmark to minimize label noise. ### Source Data and Attribution Each of the fifteen benchmarks that we revise was sourced from the following huggingface repositories: | | Type | URL | Subset | Split | License | ----- | ------ | ----- | ---- | ----| ----| | SingleOp | Math | https://huggingface.co/datasets/allenai/lila | singleop | test | [CC BY 4.0](https://github.com/allenai/Lila/blob/main/LICENSE.txt) | SingleEq | Math | https://huggingface.co/datasets/allenai/lila | singleeq | test | [CC BY 4.0](https://github.com/allenai/Lila/blob/main/LICENSE.txt) | MultiArith | Math | https://huggingface.co/datasets/allenai/lila | multiarith | test | [CC BY 4.0](https://github.com/allenai/Lila/blob/main/LICENSE.txt) | SVAMP | Math | https://huggingface.co/datasets/ChilleD/svamp | default | test | [MIT](https://github.com/arkilpatel/SVAMP/blob/main/LICENSE) | GSM8K | Math | https://huggingface.co/datasets/openai/gsm8k | main | test | [MIT](https://github.com/openai/grade-school-math/blob/master/LICENSE) | MMLU High‑School Math | Math | https://huggingface.co/datasets/cais/mmlu | high_school_mathematics | test | [MIT](https://github.com/hendrycks/test/blob/master/LICENSE) | Logic. Ded. 3-Obj | Logic | https://huggingface.co/datasets/maveriq/bigbenchhard | logical_deduction_three_objects | train | [MIT](https://github.com/suzgunmirac/BIG-Bench-Hard/blob/main/LICENSE) | Object Counting | Logic | https://huggingface.co/datasets/maveriq/bigbenchhard | object_counting | train | [MIT](https://github.com/suzgunmirac/BIG-Bench-Hard/blob/main/LICENSE) | Navigate | Logic | https://huggingface.co/datasets/maveriq/bigbenchhard | navigate | train | [MIT](https://github.com/suzgunmirac/BIG-Bench-Hard/blob/main/LICENSE) | TabFact | Table Understanding | https://huggingface.co/datasets/wenhu/tab_fact | tab_fact | test | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | HotPotQA | Reading Comp. | https://huggingface.co/datasets/hotpotqa/hotpot_qa | distractor | validation | [CC BY‑SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) | SQuAD2.0 | Reading Comp. | https://huggingface.co/datasets/rajpurkar/squad_v2 | squad_v2 | validation | [CC BY‑SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) | DROP | Reading Comp. | https://huggingface.co/datasets/ucinlp/drop | default | validation | [CC BY‑SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) | Wingograd WSC | Commonsense | https://huggingface.co/datasets/ErnestSDavis/winograd_wsc | wsc285 | test | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) | VQA | Vision | https://visualqa.org/download.html | N/A | validation | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode) Please defer to the datasets cards of these benchmarks for further details on their collection and annotation process. ## Additional Information ### Licensing Information See the table above for the licensing information of the original datasets upon which our work is based. The further annotations we provide are licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license. ### Citation Information Cite this dataset and the source datasets (see [sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)). ``` @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, } ```

# 白金基准测试（PlatinumBench）数据集卡片 [**🏆 排行榜**](http://platinum-bench.csail.mit.edu/)  |  [**🖥️ 代码仓库**](https://github.com/MadryLab/platinum-benchmarks/)  |  [**📖 研究论文**](https://arxiv.org/abs/2502.03461)  |  [**🔍 错误检视器**](http://platinum-bench.csail.mit.edu/inspect)  |  [**🛠 更新日志**](https://github.com/MadryLab/platinum-benchmarks/blob/main/CHANGELOG.md) ## 数据集描述 - **主页:** http://platinum-bench.csail.mit.edu/ - **代码仓库:** https://github.com/MadryLab/platinum-benchmarks/ - **研究论文:** https://arxiv.org/abs/2502.03461 - **排行榜:** http://platinum-bench.csail.mit.edu/ - **联系方式:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu) ### 数据集摘要 **白金基准测试集（Platinum Benchmarks）** 是经过精心筛选的基准测试集合，旨在最大限度减少标签错误与歧义，以便精准评估模型的可靠性。本数据集包含15个白金基准测试集，均通过对现有数据集的问题进行人工修订构建而成（如需了解我们对视觉问答（Visual Question Answering, VQA）修订子集的获取细节，请参阅GitHub代码仓库）。在修订每个基准测试集时，我们先使用多种前沿大语言模型（Large Language Model, LLM）对单个样本进行推理，随后对至少有一个模型出现预测错误的样本进行人工重新标注。有关修订流程的详细说明，请参阅本研究论文。 ### 加载数据集若要通过HuggingFace `datasets` 库加载本数据集，请先执行 `pip install datasets` 安装依赖库，随后运行以下代码： python from datasets import load_dataset ds = load_dataset("madrylab/platinum-bench", name="gsm8k", split="test") # 或其他子集 ds = ds.filter(lambda x: x['cleaning_status'] != 'rejected') # 过滤剔除的问题 ## 数据集结构 ### 数据集子集与清洗统计以下列出了每个白金基准测试集，以及我们通过共识筛选、修订、验证或剔除后保留的样本数量。有关各清洗状态的具体含义，请参阅“数据字段”部分。 | | 已纳入样本数 | | | | 已剔除样本数 | | ----- | ----- | ----- | ----- | ----- | ----- | 数据集 | **# 纳入** | 共识样本 | 修订样本 | 验证样本 | 剔除样本单步运算（Platinum, SingleOp） | **150** | 142 | 0 | 8 | 9 单式方程（Platinum, SingleEq） | **100** | 87 | 0 | 13 | 9 多步算术（Platinum, MultiArith） | **170** | 164 | 3 | 3 | 4 SVAMP（Platinum） | **265** | 220 | 3 | 42 | 35 GSM8K（Platinum） | **268** | 221 | 1 | 46 | 32 MMLU高中数学（Platinum, MMLU High‑School Math） | **267** | 105 | 0 | 162 | 3 3对象逻辑演绎（Platinum, Logic Ded. 3-Obj） | **200** | 159 | 0 | 41 | 0 物体计数（Platinum, Object Counting） | **190** | 57 | 0 | 133 | 10 路径导航（Platinum, Navigate） | **200** | 118 | 0 | 82 | 0 TabFact（Platinum） | **169** | 56 | 3 | 110 | 31 HotPotQA（Platinum） | **181** | 48 | 88 | 45 | 69 SQUAD2.0（Platinum） | **161** | 69 | 43 | 49 | 89 DROP（Platinum） | **209** | 27 | 179 | 3 | 41 威诺格拉德模式挑战（Platinum, Winograd WSC） | **195** | 77 | 0 | 118 | 5 视觉问答（Platinum, VQA） | **242** | 0 | 242 | 0 | 358 ### 数据实例我们从对应的HuggingFace仓库中获取了14个待修订的原生自然语言基准测试集，每个基准测试集均拥有其专属的逐样本数据字段/列。我们对这些基准测试集进行了标准化处理，为每个数据集提供了预构建的提示词（存于`platinum_prompt`字段下）。每个提示词模板可自动将原数据集中的相关字段格式化为统一结构。您可直接使用这些标准化提示词，同时我们也保留了原始数据集的字段，供有需要的用户自行构建提示词，或轻松将我们的修订基准测试集替换为原始版本。针对VQA数据集，我们从其[官方网站](https://visualqa.org/download.html)获取图像与标注，并以原始下载目录格式中的图像路径引用图像（更多细节请参阅我们的GitHub代码仓库）。以下为PlatinumBench中GSM8K子集的一个样本示例： {'cleaning_status': 'consensus', 'platinum_prompt': 'Solve the following math word problem. A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Think step-by-step. Then, provide the final answer as a single integer in the format "Answer: XXX" with no extra formatting.', 'platinum_prompt_no_cot': 'Solve the following math word problem. A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? Then, provide the final answer as a single integer in the format "Answer: XXX" with no extra formatting.', 'platinum_target': ['3'], 'platinum_parsing_strategy': 'math', 'original_target': ['3'], 'question': 'A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?', 'answer': 'It takes 2/2=<<2/2=1>>1 bolt of white fiber So the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric #### 3'} ### 数据字段 - **cleaning_status** (`str`)：清洗状态，可选值如下： 1. *consensus*（共识样本）：所有大语言模型均与标注结果一致，因此无需人工审核，默认情况下`platinum_target`与`original_target`相等。 2. *verified*（验证样本）：原始标注经人工核验确认为正确，`platinum_target`与`original_target`相等。 3. *revised*（修订样本）：标注结果已更新，与原始标注不同（`platinum_target` ≠ `original_target`）。 4. *rejected*（剔除样本）：因存在歧义等问题被移除的样本。 - **platinum_prompt** (`str`)：可直接向大语言模型提问的思维链（Chain-of-Thought, CoT）提示词，由原数据集中的字段构建而成。 - **platinum_prompt_no_cot** (`str`)：与上述提示词内容一致，但未显式添加思维链指令，适用于如`o1`这类无需思维链提示的模型。 - **platinum_target** (`List[str]`)：该问题的所有正确答案列表，多数情况下仅包含一个正确答案。 - **original_target** (`str`)：原始数据集中提供的标注结果，若原始标注存在错误，则该字段与`platinum_target`可能不一致。 - **platinum_parsing_strategy** (`str`)：用于解析大语言模型输出答案的解析器，具体用法请参阅提供的代码。 - **image_path** (`str`)：仅在VQA数据集中出现，用于指定源图像的路径，例如：`'val2014/COCO_val2014_000000304481.jpg'`。 - 我们还在上述字段之后保留了所有原始数据集的字段。 > [!NOTE] > 本HuggingFace数据集包含了未用于评估的剔除样本。若仅使用我们纳入白金基准测试集的样本，请务必执行如下过滤操作： > >`ds = ds.filter(lambda x: x['cleaning_status'] != 'rejected')` ### 提示词示例以下为MultiArith数据集中某问题的标准化提示词示例： Solve the following math word problem. At the schools book fair Sam bought 13 adventure books and 17 mystery books. If 15 of the books were used, how many new books did he buy? Think step-by-step. Then, provide the final answer as a single number in the format "Answer: XXX" with no extra formatting. 具体的提示词模板与解析策略取决于模型类型，但多数基准测试集之间存在通用配置。 ## 数据集创建 ### 遴选依据当前多数大语言模型基准测试集普遍存在标签噪声问题，例如标注错误或问题存在歧义。由于这类标签噪声的存在，相关基准测试的性能进展往往会在模型真正实现可靠性能之前就陷入停滞。因此，学界常将这类基准测试集视为“已饱和”并过早弃用，这也阻碍了机器学习从业者追求真正可靠的模型性能。为解决当前基准测试实践中的这一痛点，我们对15个“已饱和”的基准测试集样本进行修订，以最大限度降低标签噪声。 ### 源数据与归属声明我们修订的15个基准测试集均来自以下HuggingFace仓库： | | 类型 | 链接 | 子集 | 划分 | 许可协议 | ----- | ------ | ----- | ---- | ----| ----| | SingleOp | 数学 | https://huggingface.co/datasets/allenai/lila | singleop | test | [知识共享署名4.0国际许可协议（CC BY 4.0）](https://github.com/allenai/Lila/blob/main/LICENSE.txt) | SingleEq | 数学 | https://huggingface.co/datasets/allenai/lila | singleeq | test | [知识共享署名4.0国际许可协议（CC BY 4.0）](https://github.com/allenai/Lila/blob/main/LICENSE.txt) | MultiArith | 数学 | https://huggingface.co/datasets/allenai/lila | multiarith | test | [知识共享署名4.0国际许可协议（CC BY 4.0）](https://github.com/allenai/Lila/blob/main/LICENSE.txt) | SVAMP | 数学 | https://huggingface.co/datasets/ChilleD/svamp | default | test | [MIT许可协议](https://github.com/arkilpatel/SVAMP/blob/main/LICENSE) | GSM8K | 数学 | https://huggingface.co/datasets/openai/gsm8k | main | test | [MIT许可协议](https://github.com/openai/grade-school-math/blob/master/LICENSE) | MMLU高中数学 | 数学 | https://huggingface.co/datasets/cais/mmlu | high_school_mathematics | test | [MIT许可协议](https://github.com/hendrycks/test/blob/master/LICENSE) | 3对象逻辑演绎 | 逻辑推理 | https://huggingface.co/datasets/maveriq/bigbenchhard | logical_deduction_three_objects | train | [MIT许可协议](https://github.com/suzgunmirac/BIG-Bench-Hard/blob/main/LICENSE) | 物体计数 | 逻辑推理 | https://huggingface.co/datasets/maveriq/bigbenchhard | object_counting | train | [MIT许可协议](https://github.com/suzgunmirac/BIG-Bench-Hard/blob/main/LICENSE) | 路径导航 | 逻辑推理 | https://huggingface.co/datasets/maveriq/bigbenchhard | navigate | train | [MIT许可协议](https://github.com/suzgunmirac/BIG-Bench-Hard/blob/main/LICENSE) | TabFact | 表格理解 | https://huggingface.co/datasets/wenhu/tab_fact | tab_fact | test | [知识共享署名4.0国际许可协议（CC BY 4.0）](https://creativecommons.org/licenses/by/4.0/legalcode) | HotPotQA | 阅读理解 | https://huggingface.co/datasets/hotpotqa/hotpot_qa | distractor | validation | [知识共享署名-相同方式共享4.0国际许可协议（CC BY-SA 4.0）](https://creativecommons.org/licenses/by-sa/4.0/legalcode) | SQuAD2.0 | 阅读理解 | https://huggingface.co/datasets/rajpurkar/squad_v2 | squad_v2 | validation | [知识共享署名-相同方式共享4.0国际许可协议（CC BY-SA 4.0）](https://creativecommons.org/licenses/by-sa/4.0/legalcode) | DROP | 阅读理解 | https://huggingface.co/datasets/ucinlp/drop | default | validation | [知识共享署名-相同方式共享4.0国际许可协议（CC BY-SA 4.0）](https://creativecommons.org/licenses/by-sa/4.0/legalcode) | 威诺格拉德模式挑战 | 常识推理 | https://huggingface.co/datasets/ErnestSDavis/winograd_wsc | wsc285 | test | [知识共享署名4.0国际许可协议（CC BY 4.0）](https://creativecommons.org/licenses/by/4.0/legalcode) | VQA | 视觉语言 | https://visualqa.org/download.html | N/A | validation | [知识共享署名4.0国际许可协议（CC BY 4.0）](https://creativecommons.org/licenses/by/4.0/legalcode) 请参阅这些基准测试集的数据集卡片，以了解其收集与标注流程的更多细节。 ## 附加信息 ### 许可协议信息本数据集基于的原始数据集的许可协议信息详见上文表格。我们额外提供的标注内容采用[知识共享署名-相同方式共享4.0国际许可协议（CC BY-SA 4.0）](https://creativecommons.org/licenses/by-sa/4.0/legalcode)进行授权。 ### 引用信息请同时引用本数据集与原始源数据集（详情请参阅[sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)）。 @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, }

提供机构：

maas

创建时间：

2025-02-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集