five

platinum-bench-paper-cache

收藏
魔搭社区2025-10-09 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/madrylab/platinum-bench-paper-cache
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for PlatinumBench LLM Cache - **PlatinumBench:** https://huggingface.co/madrylab/platinum-bench - **GitHub:** https://github.com/MadryLab/platinum-benchmarks - **Leaderboard:** http://platinum-bench.csail.mit.edu/ ## Dataset Description - **Homepage:** http://platinum-bench.csail.mit.edu/ - **Repository:** https://github.com/MadryLab/platinum-benchmarks - **Paper:** https://arxiv.org/abs/2502.03461 - **Leaderboard:** http://platinum-bench.csail.mit.edu/ - **Point of Contact:** [Joshua Vendrow](mailto:jvendrow@mit.edu), [Edward Vendrow](mailto:evendrow@mit.edu) ### Dataset Summary _**Platinum Benchmarks**_ are benchmarks that are are carefully curated to minimize label errors and ambiguity, allowing us to measure reliability of models. This repository contains a cache of the LLM inferences for the models we test in our paper, which can be used to exactly reproduce our results. We provide a separate cache for each dataset we test. ### Load the Dataset To download the caches, use the script provided in our Github repository: ``` git clone https://github.com/MadryLab/platinum-benchmarks.git cd platinum-benchmarks bash scripts/download_paper_cache.sh ``` Then, refer to the instructions in the repository for how to reproduce the paper results using these caches [here](https://github.com/MadryLab/platinum-benchmarks) ## Additional Information ### Licensing Information See [PlatinumBench](https://huggingface.co/datasets/madrylab/platinum-bench) for the licensing information of the original datasets upon which our work is based. The further annotations and cached LLM responses we provide are licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) license. ### Citation Information Cite this dataset and the source datasets (see [sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib)). ``` @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, } ```

# PlatinumBench 大语言模型缓存数据集卡片 - **PlatinumBench:** https://huggingface.co/madrylab/platinum-bench - **GitHub:** https://github.com/MadryLab/platinum-benchmarks - **排行榜:** http://platinum-bench.csail.mit.edu/ ## 数据集描述 - **主页:** http://platinum-bench.csail.mit.edu/ - **代码仓库:** https://github.com/MadryLab/platinum-benchmarks - **相关论文:** https://arxiv.org/abs/2502.03461 - **排行榜:** http://platinum-bench.csail.mit.edu/ - **联系方式:** [Joshua Vendrow](mailto:jvendrow@mit.edu)、[Edward Vendrow](mailto:evendrow@mit.edu) ### 数据集概述 **PlatinumBench基准集** 是经过精心筛选的基准测试集,旨在尽可能减少标注错误与歧义,从而实现对模型可靠性的精准评估。 本仓库存储了我们在论文中测试过的所有模型的大语言模型(Large Language Model)推理结果缓存,可用于精准复现论文中的实验结果。我们为每个测试数据集均提供了独立的缓存文件。 ### 数据集加载 如需下载缓存文件,请使用本GitHub仓库中提供的脚本: git clone https://github.com/MadryLab/platinum-benchmarks.git cd platinum-benchmarks bash scripts/download_paper_cache.sh 随后,请参照仓库内的说明文档,使用上述缓存文件复现论文实验结果,具体路径为[此处](https://github.com/MadryLab/platinum-benchmarks) ## 附加信息 ### 许可信息 本研究基于的原始数据集的许可信息,请参见[PlatinumBench](https://huggingface.co/datasets/madrylab/platinum-bench)。我们额外提供的标注内容与大语言模型推理结果缓存,均采用[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)许可协议进行授权。 ### 引用信息 请引用本数据集与原始源数据集(具体格式参见[sources.bib](https://github.com/MadryLab/platinum-benchmarks/blob/main/sources.bib))。 @misc{vendrow2025largelanguagemodelbenchmarks, title={Do Large Language Model Benchmarks Test Reliability?}, author={Joshua Vendrow and Edward Vendrow and Sara Beery and Aleksander Madry}, year={2025}, eprint={2502.03461}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.03461}, }
提供机构:
maas
创建时间:
2025-02-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作