platinum-bench-paper-cache
收藏魔搭社区2025-10-09 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/madrylab/platinum-bench-paper-cache
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for PlatinumBench LLM Cache
- **PlatinumBench:** https://huggingface.co/madrylab/platinum-bench
- **GitHub:** https://github.com/MadryLab/platinum-benchmarks
- **Leaderboard:** http://platinum-bench.csail.mit.edu/
## Dataset Description
- **Homepage:** http://platinum-bench.csail.mit.edu/
- **Repository:** https://github.com/MadryLab/platinum-benchmarks
- **Paper:** https://arxiv.org/abs/2502.03461
- **Leaderboard:** http://platinum-bench.csail.mit.edu/
- **Point of Contact:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu)
### Dataset Summary
_**Platinum Benchmarks**_ are benchmarks that are are carefully curated to minimize label errors and ambiguity, allowing us to measure reliability of models.
This repository contains a cache of the LLM inferences for the models we test in our paper, which can be used to exactly reproduce our results. We provide a separate cache for each dataset we test.
### Load the Dataset
To download the caches, use the script provided in our Github repository:
```
git clone https://github.com/MadryLab/platinum-benchmarks.git
cd platinum-benchmarks
bash scripts/download_paper_cache.sh
```
Then, refer to the instructions in the repository for how to reproduce the paper results using these caches [here](https://github.com/MadryLab/platinum-benchmarks)
## Additional Information
### Licensing Information
See [PlatinumBench](https://huggingface.co/datasets/madrylab/platinum-bench) for the licensing information of the original datasets upon which our work is based. The further annotations and cached LLM responses we provide are licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license.
### Citation Information
Cite this dataset and the source datasets (see [sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)).
```
@misc{vendrow2025largelanguagemodelbenchmarks,
title={Do Large Language Model Benchmarks Test Reliability?},
author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry},
year={2025},
eprint={2502.03461},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.03461},
}
```
# PlatinumBench 大语言模型缓存数据集卡片
- **PlatinumBench:** https://huggingface.co/madrylab/platinum-bench
- **GitHub:** https://github.com/MadryLab/platinum-benchmarks
- **排行榜:** http://platinum-bench.csail.mit.edu/
## 数据集描述
- **主页:** http://platinum-bench.csail.mit.edu/
- **代码仓库:** https://github.com/MadryLab/platinum-benchmarks
- **相关论文:** https://arxiv.org/abs/2502.03461
- **排行榜:** http://platinum-bench.csail.mit.edu/
- **联系方式:** [Joshua Vendrow](mailto:jvendrow@mit.edu)、[Edward Vendrow](mailto:evendrow@mit.edu)
### 数据集概述
**PlatinumBench基准集** 是经过精心筛选的基准测试集,旨在尽可能减少标注错误与歧义,从而实现对模型可靠性的精准评估。
本仓库存储了我们在论文中测试过的所有模型的大语言模型(Large Language Model)推理结果缓存,可用于精准复现论文中的实验结果。我们为每个测试数据集均提供了独立的缓存文件。
### 数据集加载
如需下载缓存文件,请使用本GitHub仓库中提供的脚本:
git clone https://github.com/MadryLab/platinum-benchmarks.git
cd platinum-benchmarks
bash scripts/download_paper_cache.sh
随后,请参照仓库内的说明文档,使用上述缓存文件复现论文实验结果,具体路径为[此处](https://github.com/MadryLab/platinum-benchmarks)
## 附加信息
### 许可信息
本研究基于的原始数据集的许可信息,请参见[PlatinumBench](https://huggingface.co/datasets/madrylab/platinum-bench)。我们额外提供的标注内容与大语言模型推理结果缓存,均采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)许可协议进行授权。
### 引用信息
请引用本数据集与原始源数据集(具体格式参见[sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib))。
@misc{vendrow2025largelanguagemodelbenchmarks,
title={Do Large Language Model Benchmarks Test Reliability?},
author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry},
year={2025},
eprint={2502.03461},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.03461},
}
提供机构:
maas
创建时间:
2025-02-08



