platinum-bench-paper-cache

Name: platinum-bench-paper-cache
Creator: maas
Published: 2025-10-09 16:22:39
License: 暂无描述

魔搭社区2025-10-09 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/madrylab/platinum-bench-paper-cache

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PlatinumBench LLM Cache - **PlatinumBench:** https://huggingface.co/madrylab/platinum-bench - **GitHub:** https://github.com/MadryLab/platinum-benchmarks - **Leaderboard:** http://platinum-bench.csail.mit.edu/ ## Dataset Description - **Homepage:** http://platinum-bench.csail.mit.edu/ - **Repository:** https://github.com/MadryLab/platinum-benchmarks - **Paper:** https://arxiv.org/abs/2502.03461 - **Leaderboard:** http://platinum-bench.csail.mit.edu/ - **Point of Contact:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu) ### Dataset Summary _**Platinum Benchmarks**_ are benchmarks that are are carefully curated to minimize label errors and ambiguity, allowing us to measure reliability of models. This repository contains a cache of the LLM inferences for the models we test in our paper, which can be used to exactly reproduce our results. We provide a separate cache for each dataset we test. ### Load the Dataset To download the caches, use the script provided in our Github repository: ``` git clone https://github.com/MadryLab/platinum-benchmarks.git cd platinum-benchmarks bash scripts/download_paper_cache.sh ``` Then, refer to the instructions in the repository for how to reproduce the paper results using these caches [here](https://github.com/MadryLab/platinum-benchmarks) ## Additional Information ### Licensing Information See [PlatinumBench](https://huggingface.co/datasets/madrylab/platinum-bench) for the licensing information of the original datasets upon which our work is based. The further annotations and cached LLM responses we provide are licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license. ### Citation Information Cite this dataset and the source datasets (see [sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)). ``` @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, } ```

# PlatinumBench 大语言模型缓存数据集卡片 - **PlatinumBench：** https://huggingface.co/madrylab/platinum-bench - **GitHub：** https://github.com/MadryLab/platinum-benchmarks - **排行榜：** http://platinum-bench.csail.mit.edu/ ## 数据集描述 - **主页：** http://platinum-bench.csail.mit.edu/ - **代码仓库：** https://github.com/MadryLab/platinum-benchmarks - **相关论文：** https://arxiv.org/abs/2502.03461 - **排行榜：** http://platinum-bench.csail.mit.edu/ - **联系方式：** [Joshua Vendrow](mailto:jvendrow@mit.edu)、[Edward Vendrow](mailto:evendrow@mit.edu) ### 数据集概述 **PlatinumBench基准集** 是经过精心筛选的基准测试集，旨在尽可能减少标注错误与歧义，从而实现对模型可靠性的精准评估。本仓库存储了我们在论文中测试过的所有模型的大语言模型（Large Language Model）推理结果缓存，可用于精准复现论文中的实验结果。我们为每个测试数据集均提供了独立的缓存文件。 ### 数据集加载如需下载缓存文件，请使用本GitHub仓库中提供的脚本： git clone https://github.com/MadryLab/platinum-benchmarks.git cd platinum-benchmarks bash scripts/download_paper_cache.sh 随后，请参照仓库内的说明文档，使用上述缓存文件复现论文实验结果，具体路径为[此处](https://github.com/MadryLab/platinum-benchmarks) ## 附加信息 ### 许可信息本研究基于的原始数据集的许可信息，请参见[PlatinumBench](https://huggingface.co/datasets/madrylab/platinum-bench)。我们额外提供的标注内容与大语言模型推理结果缓存，均采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)许可协议进行授权。 ### 引用信息请引用本数据集与原始源数据集（具体格式参见[sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)）。 @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, }

提供机构：

maas

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集