platinum-bench-paper-version

Name: platinum-bench-paper-version
Creator: maas
Published: 2025-11-07 16:22:46
License: 暂无描述

魔搭社区2025-11-07 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/madrylab/platinum-bench-paper-version

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PlatinumBench (Paper Version) [**🏆 Leaderboard**](http://platinum-bench.csail.mit.edu/)  |  [**🖥️ Code**](https://github.com/MadryLab/platinum-benchmarks/)  |  [**📖 Paper**](https://arxiv.org/abs/2502.03461) ## Dataset Description - **Homepage:** http://platinum-bench.csail.mit.edu/ - **Repository:** https://github.com/MadryLab/platinum-benchmarks/ - **Paper:** https://arxiv.org/abs/2502.03461 - **Leaderboard:** http://platinum-bench.csail.mit.edu/ - **Point of Contact:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu) > [!NOTE] > This HuggingFace dataset contains the _paper version_ of the dataset. > Unless you are specifically interested in reproducing the results from our paper, we recommend that you use the live version, which we update as we find new issues with questions. > Please find it at [here](https://huggingface.co/datasets/madrylab/platinum-bench) ### Dataset Summary _**Platinum Benchmarks**_ are benchmarks that are are carefully curated to minimize label errors and ambiguity, allowing us to measure reliability of models. This dataset containts fifteen platinum benchmarks created by manually revising questions from existing datasets (see the github repo for details on accessing our revised subset of VQA). To revise each benchmark, we ran a vareity of frontier models on individual examples and manually re-annotated any example for which at least one model made an error. See the paper for further details on the revision process. ### Load the Dataset To load the dataset using HuggingFace `datasets`, you first need to `pip install datasets`, then run the following code: ```python from datasets import load_dataset ds = load_dataset("madrylab/platinum-bench-paper-version", name="gsm8k", split="test") # or another subset ds = ds.filter(lambda x: x['cleaning_status'] != 'rejected') # filter out rejected questions ``` **For all additional information including licensing, please refer to the main dataset at [https://huggingface.co/datasets/madrylab/platinum-bench](https://huggingface.co/datasets/madrylab/platinum-bench)**. ### Citation Information Cite this dataset and the source datasets (see [sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)). ``` @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, } ```

# PlatinumBench（论文版）数据集卡片 [**🏆 排行榜（Leaderboard）**](http://platinum-bench.csail.mit.edu/)  |  [**🖥️ 代码（Code）**](https://github.com/MadryLab/platinum-benchmarks/)  |  [**📖 论文（Paper）**](https://arxiv.org/abs/2502.03461) ## 数据集说明 - **主页:** http://platinum-bench.csail.mit.edu/ - **代码仓库:** https://github.com/MadryLab/platinum-benchmarks/ - **论文:** https://arxiv.org/abs/2502.03461 - **排行榜:** http://platinum-bench.csail.mit.edu/ - **联系人:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu) > [!注意] > 本HuggingFace数据集仅包含该数据集的**论文版本**。若您无需专门复现本论文的实验结果，我们推荐使用实时更新版本：该版本会随我们发现题目存在的新问题持续迭代优化。您可通过[此处](https://huggingface.co/datasets/madrylab/platinum-bench)获取该版本。 ### 数据集概览 _**Platinum基准测试集（Platinum Benchmarks）**_ 是经过精心遴选打磨以最小化标签错误与歧义的基准测试集，可用于量化评估模型的可靠性。本数据集共包含15个Platinum基准测试集，均通过对现有公开数据集的题目进行手动修订构建而成（如需了解我们修订后的视觉问答（Visual Question Answering，VQA）子集的获取方式，请参阅本项目代码仓库）。具体修订流程为：先使用多种前沿模型对单条样本进行推理测试，再对至少存在一个模型预测错误的样本执行手动重新标注。有关修订流程的完整细节，请参阅配套论文。 ### 数据集加载若需使用HuggingFace `datasets`库加载本数据集，请先执行`pip install datasets`安装依赖库，随后运行以下代码： python from datasets import load_dataset ds = load_dataset("madrylab/platinum-bench-paper-version", name="gsm8k", split="test") # 或选择其他子集 ds = ds.filter(lambda x: x['cleaning_status'] != 'rejected') # 过滤掉被驳回的题目 **如需了解包括许可协议在内的所有额外信息，请参阅主数据集页面[https://huggingface.co/datasets/madrylab/platinum-bench](https://huggingface.co/datasets/madrylab/platinum-bench)**。 ### 引用信息请引用本数据集及相关源数据集（详细引用信息请参阅[sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)）。 bibtex @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, }

提供机构：

maas

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集