autobencher-qa-33k
收藏魔搭社区2025-12-03 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/allenai/autobencher-qa-33k
下载链接
链接失效反馈官方服务:
资源简介:
These are 33K questions generated using [Autobencher](https://arxiv.org/abs/2407.08351). The questions come from randomly sampled Wikipedia articles, which are further filtered and transformed into questions by GPT-4o.
This benchmark is used in the [signal and noise](https://huggingface.co/datasets/allenai/signal-and-noise) project to demonstrate the impact of a large sample size on the modeling noise of a benchmark.
### Citation
Please cite the original authors of Autobencher, and our work which generated this particular evaluation set:
```
@article{li2024autobencher,
title={Autobencher: Towards declarative benchmark construction},
author={Li, Xiang Lisa and Kaiyom, Farzaan and Liu, Evan Zheran and Mai, Yifan and Liang, Percy and Hashimoto, Tatsunori},
journal={arXiv preprint arXiv:2407.08351},
year={2024}
}
```
```
@article{heineman2025signal,
title={Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation},
author={Heineman, David and Hofmann, Valentin and Magnusson, Ian and Gu, Yuling and Smith, Noah A and Hajishirzi, Hannaneh and Lo, Kyle and Dodge, Jesse},
journal={arXiv preprint arXiv:2508.13144},
year={2025}
}
```
### Dataset Description
- **Developed by:** Allen Institute for AI (Ai2)
- **Language(s) (NLP):** English
- **License:** This dataset contains model outputs generated from GPT-4o, which is subject to OpenAI's [Terms of Use](https://openai.com/policies/row-terms-of-use/). This dataset is licensed under CC BY 4.0. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use)
- **Contact:** Technical inquiries: `davidh@allenai.org`. Press: `press@allenai.org`
本数据集包含33000个由Autobencher生成的问题。这些问题源自随机采样的维基百科文章,后续经GPT-4o进一步筛选并转换为问答形式。
该基准测试集被应用于[signal and noise](https://huggingface.co/datasets/allenai/signal-and-noise)项目,用于展示大样本量对基准测试建模噪声的影响。
### 引用
请同时引用Autobencher的原作者与本评测集的生成工作:
@article{li2024autobencher,
title={Autobencher: Towards declarative benchmark construction},
author={Li, Xiang Lisa and Kaiyom, Farzaan and Liu, Evan Zheran and Mai, Yifan and Liang, Percy and Hashimoto, Tatsunori},
journal={arXiv preprint arXiv:2407.08351},
year={2024}
}
@article{heineman2025signal,
title={Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation},
author={Heineman, David and Hofmann, Valentin and Magnusson, Ian and Gu, Yuling and Smith, Noah A and Hajishirzi, Hannaneh and Lo, Kyle and Dodge, Jesse},
journal={arXiv preprint arXiv:2508.13144},
year={2025}
}
### 数据集描述
- **开发者:** 艾伦人工智能研究所(Allen Institute for AI, Ai2)
- **(自然语言处理适用)语言:** 英语
- **授权协议:** 本数据集包含由GPT-4o生成的模型输出,需遵循OpenAI的[使用条款](https://openai.com/policies/row-terms-of-use/)。本数据集采用CC BY 4.0协议授权,仅可用于符合艾伦人工智能研究所[负责任使用指南](https://allenai.org/responsible-use)的研究与教育用途。
- **联系方式:** 技术咨询:`davidh@allenai.org`;媒体咨询:`press@allenai.org`
提供机构:
maas
创建时间:
2025-08-25



