five

fluid-benchmarking

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/allenai/fluid-benchmarking
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <h1>Fluid Language Model Benchmarking</h1> </div> <p align="center"> <a href="https://creativecommons.org/licenses/by/4.0/deed.en"> <img src="https://img.shields.io/badge/CC_BY-4.0-ED592F?logo=creativecommons&logoColor=white"> </a> <a href="https://github.com/allenai/fluid-benchmarking"> <img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github&logoColor=white"> </a> <a href="https://arxiv.org/abs/2509.11106"> <img src="https://img.shields.io/badge/ArXiv-2509.11106-B31B1B?logo=arxiv&logoColor=white"> </a> <a href="https://allenai.org/blog/fluid-benchmarking"> <img src="https://img.shields.io/badge/Ai2-Blog-F0529C?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAMgAAADICAYAAACtWK6eAAAEIUlEQVR4nO3dsZEcRRTH4V5KDmGQAAngkYBkEQE2OeAqB%2BHjIWKQfUmASQB4i8OVrlTcn9u9mXmvp7%2Bvat29Vs/%2B5s2spLnL9XodwH/7qnoB0JlAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBDIfj6OMa4HvNiRQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEFyuV//vfydHbezloJ%2BzJBMEAoFAIJB9uG49iTfVCyjydof3/H6M8dMO7/tS344xvtnx/R/GGH/s%2BP4trXiT/naM8Vv1Ijb0eJN%2BxIFc7gsBl1gQCAQCgUAgkLn9Xr2AsxPI3Pb4No4nBAKBQCAQyLz%2Brl7ACgQyr6%2BrF7ACgUAgkDkt908%2BqggEAoHMx/Q4kEDmIo6DCQQCgczD9CggkDmIo4hA%2BhNHIYH0Jo5iqz60oTthNGGC9COORkyQPoTRkAlS7zLmiWO5Z0StOEEein/%2BLDE85zrm/zO82IoPjjurigP559j%2BhPPLGOPjxu95N4Gcx5kOZJsJ1e0Sq/ogtzkw9NAtkGpPAxULvsUKrv%2B%2BPHtqYd3uQVot5gvdJ0rnvbtVm702QV7ucaKwEIHcrmsk31Uv4IxcYt2vzWXAEzPtX9Jmb02Q%2B53lw0ggkNfpFkmbM%2B9ZCOT1ukXChgSyjZ%2BrF/DE%2B%2BoFnImb9O10uryZeR/HaLSXJsh2On0o23zAZicQCASyLVPkZARybiJ5JYFsr9MUGUMkryKQNYjkTgJZh0juIJC1XMYY76oXMRN/Ubif7mfsznvdZu9MkHXN9MC6MgJBKIGnmvDoy0g6X4IdRiA8x1QZ/QLZ86A4I3Kzle5BZj8jfhifn6xS9VpOtwmyt8uY70DPtt5TWS2QWYiiiZUusR79WL2A/yGORlacIH9VL%2BAZwmhoxQnSkTiaEkg9cTQmkFriaE4gdcQxAYHUEMckBHI8cUxEIBAI5Fimx2QEcpxfqxfA7QRynB%2BqF8DtBAKBQI7h3mNSAoFAIPvz65knJpD9fapeAPcTCAQCgUAgEAgEghUDeaheAPPo9usPzujoDZ79AXmtrDhBzkwcGxPIeYhjBwI5B3HsRCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIHhTvYAFeCTPxEwQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCwT9pWpVuCH9MegAAAABJRU5ErkJggg%3D%3D&logoWidth=20&labelColor=555555"> </a> </p> This dataset provides IRT models for [ARC Challenge](https://huggingface.co/datasets/allenai/ai2_arc), [GSM8K](https://huggingface.co/datasets/openai/gsm8k), [HellaSwag](https://huggingface.co/datasets/Rowan/hellaswag), [MMLU](https://huggingface.co/datasets/cais/mmlu), [TruthfulQA](https://github.com/sylinrl/TruthfulQA), and [WinoGrande](https://huggingface.co/datasets/allenai/winogrande). Furthermore, it contains results for pretraining checkpoints of [Amber-6.7B](https://huggingface.co/LLM360/Amber), [K2-65B](https://huggingface.co/LLM360/K2), [OLMo1-7B](https://huggingface.co/allenai/OLMo-7B-0724-hf), [OLMo2-7B](https://huggingface.co/allenai/OLMo-2-1124-7B), [Pythia-2.8B](https://huggingface.co/EleutherAI/pythia-2.8b), and [Pythia-6.9B](https://huggingface.co/EleutherAI/pythia-6.9b), evaluated on these six benchmarks. ### 🚀 Usage For utilities to use the dataset and to replicate the results from the paper, please see the corresponding [GitHub repository](https://github.com/allenai/fluid-benchmarking). The following example demonstrates how to load IRT models and language model evaluation results: ```python from fluid_benchmarking import datasets # Load IRT model for specified benchmark benchmark = "mmlu" irt_model = datasets.load_irt_model( repo_id="allenai/fluid-benchmarking", filename=f"data/irt_models/{benchmark}.csv", ) # Load evaluation results for specified LM lm = "olmo1-7b" lm_eval_results = datasets.load_lm_eval_results( repo_id="allenai/fluid-benchmarking", filename=f"data/lm_eval_results/{lm}.csv", ) ``` The dataset also contains accuracy scores and IRT ability estimates for the 102 language models from the [Open LLM Leaderboard](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/archive) used in the paper ([`data/open_llm_leaderboard_results.json`](https://huggingface.co/datasets/allenai/fluid-benchmarking/blob/main/data/open_llm_leaderboard_results.json)) as well as a mapping from item IDs to question text and answer options ([`data/id_to_item_map.json`](https://huggingface.co/datasets/allenai/fluid-benchmarking/blob/main/data/id_to_item_map.json)). ### 📚 Citation ``` @inproceedings{hofmann2025fluid, title={Fluid Language Model Benchmarking}, author={Valentin Hofmann and David Heineman and Ian Magnusson and Kyle Lo and Jesse Dodge and Maarten Sap and Pang Wei Koh and Chun Wang and Hannaneh Hajishirzi and Noah A. Smith}, booktitle={Second Conference on Language Modeling}, year={2025} } ``` ### ⚖️ License This dataset is licensed under CC BY-4.0. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).

<div align="center"> <h1>流体语言模型基准测试(Fluid Language Model Benchmarking)</h1> </div> <p align="center"> <a href="https://creativecommons.org/licenses/by/4.0/deed.en"> <img src="https://img.shields.io/badge/CC_BY-4.0-ED592F?logo=creativecommons&logoColor=white"> </a> <a href="https://github.com/allenai/fluid-benchmarking"> <img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github&logoColor=white"> </a> <a href="https://arxiv.org/abs/2509.11106"> <img src="https://img.shields.io/badge/ArXiv-2509.11106-B31B1B?logo=arxiv&logoColor=white"> </a> <a href="https://allenai.org/blog/fluid-benchmarking"> <img src="https://img.shields.io/badge/Ai2-Blog-F0529C?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAMgAAADICAYAAACtWK6eAAAEIUlEQVR4nO3dsZEcRRTH4V5KDmGQAAngkYBkEQE2OeAqB%2BHjIWKQfUmASQB4i8OVrlTcn9u9mXmvp7%2Bvat29Vs/%2B5s2spLnL9XodwH/7qnoB0JlAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBDIfj6OMa4HvNiRQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEFyuV//vfydHbezloJ%2BzJBMEAoFAIJB9uG49iTfVCyjydof3/H6M8dMO7/tS344xvtnx/R/GGH/s%2BP4trXiT/naM8Vv1Ijb0eJN%2BxIFc7gsBl1gQCAQCgUAgkLn9Xr2AsxPI3Pb4No4nBAKBQCAQyLz%2Brl7ACgQyr6%2BrF7ACgUAgkDkt908%2BqggEAoHMx/Q4kEDmIo6DCQQCgczD9CggkDmIo4hA%2BhNHIYH0Jo5iqz60oTthNGGC9COORkyQPoTRkAlS7zLmiWO5Z0StOEEein/%2BLDE85zrm/zO82IoPjjurigP559j%2BhPPLGOPjxu95N4Gcx5kOZJsJ1e0Sq/ogtzkw9NAtkGpPAxULvsUKrv%2B%2BPHtqYd3uQVot5gvdJ0rnvbtVm702QV7ucaKwEIHcrmsk31Uv4IxcYt2vzWXAEzPtX9Jmb02Q%2B53lw0ggkNfpFkmbM%2B9ZCOT1ukXChgSyjZ%2BrF/DE%2B%2BoFnImb9O10uryZeR/HaLSXJsh2On0o23zAZicQCASyLVPkZARybiJ5JYFsr9MUGUMkryKQNYjkTgJZh0juIJC1XMYY76oXMRN/Ubif7mfsznvdZu9MkHXN9MC6MgJBKIGnmvDoy0g6X4IdRiA8x1QZ/QLZ86A4I3Kzle5BZj8jfhifn6xS9VpOtwmyt8uY70DPtt5TWS2QWYiiiZUusR79WL2A/yGORlacIH9VL%2BAZwmhoxQnSkTiaEkg9cTQmkFriaE4gdcQxAYHUEMckBHI8cUxEIBAI5Fimx2QEcpxfqxfA7QRynB%2BqF8DtBAKBQI7h3mNSAoFAIPvz65knJpD9fapeAPcTCAQCgUAgEAgEghUDeaheAPPo9usPzujoDZ79AXmtrDhBzkwcGxPIeYhjBwI5B3HsRCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIHhTvYAFeCTPxEwQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCgUAgEAgEAoFAIBAIBAKBQCAQCAQCwT9pWpVuCH9MegAAAABJRU5ErkJggg%3D%3D&logoWidth=20&labelColor=555555"> </a> </p> 本数据集提供了针对ARC挑战集(ARC Challenge)、GSM8K数据集(GSM8K)、HellaSwag数据集(HellaSwag)、多任务语言理解(Massive Multitask Language Understanding, MMLU)、TruthfulQA数据集(TruthfulQA)以及WinoGrande数据集(WinoGrande)的项目反应理论(Item Response Theory, IRT)模型。此外,本数据集还包含在上述六大基准上评估的Amber-6.7B、K2-65B、OLMo1-7B、OLMo2-7B、Pythia-2.8B以及Pythia-6.9B的预训练检查点结果。 ### 🚀 使用方法 如需使用本数据集并复现论文中的实验结果,请参阅对应的[GitHub仓库](https://github.com/allenai/fluid-benchmarking)。以下示例展示了如何加载项目反应理论模型与语言模型评估结果: python from fluid_benchmarking import datasets # 加载指定基准的项目反应理论模型 benchmark = "mmlu" irt_model = datasets.load_irt_model( repo_id="allenai/fluid-benchmarking", filename=f"data/irt_models/{benchmark}.csv", ) # 加载指定语言模型的评估结果 lm = "olmo1-7b" lm_eval_results = datasets.load_lm_eval_results( repo_id="allenai/fluid-benchmarking", filename=f"data/lm_eval_results/{lm}.csv", ) 本数据集还包含论文中使用的[Open LLM Leaderboard](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/archive)旗下102个语言模型的准确率得分与项目反应理论能力估计值,同时提供了项目ID到问题文本及答案选项的映射文件(`data/id_to_item_map.json`)与Open LLM Leaderboard评估结果文件(`data/open_llm_leaderboard_results.json`)。 ### 📚 引用信息 @inproceedings{hofmann2025fluid, title={流体语言模型基准测试(Fluid Language Model Benchmarking)}, author={Valentin Hofmann and David Heineman and Ian Magnusson and Kyle Lo and Jesse Dodge and Maarten Sap and Pang Wei Koh and Chun Wang and Hannaneh Hajishirzi and Noah A. Smith}, booktitle={第二届语言建模会议(Second Conference on Language Modeling)}, year={2025} } ### ⚖️ 许可协议 本数据集采用CC BY-4.0许可协议发布,仅可用于研究与教育用途,并需遵循Allen Institute for AI(Ai2)的[负责任使用指南](https://allenai.org/responsible-use).
提供机构:
maas
创建时间:
2025-09-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作