AttributionBench

Name: AttributionBench
Creator: maas
Published: 2025-12-05 16:40:33
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-05 收录

下载链接：

https://modelscope.cn/datasets/osunlp/AttributionBench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for AttributionBench - Github repository: <a href="https://github.com/OSU-NLP-Group/AttributionBench/">[Github]</a> - Paper: <a href="https://arxiv.org/abs/2402.15089">AttributionBench: How Hard is Automatic Attribution Evaluation?</a> - Point of Contact: <a href="mailto:li.14042@osu.edu">Yifei Li</a> # Dataset Overview We constructed this dataset from multiple existing data sources in a unified format, in order to create a unified and diverse testbed for evaluating advanced attribution evaluation systems. The dataset contains both in-domain training set and id-domain and out-of-domain test set. ## Usage ```python import datasets features = datasets.Features({ 'question': datasets.Value('string'), 'claim': datasets.Value('string'), 'claim_raw_string': datasets.Value('string'), 'response': datasets.Value('string'), 'references': datasets.Sequence(datasets.Value("string")), 'citation_links': datasets.Sequence(datasets.Value("string")), 'webpage_references': datasets.Sequence(datasets.Value("string")), 'attribution_label': datasets.Value('string'), 'src_dataset': datasets.Value('string'), 'id': datasets.Value('string'), }) # in-domain train (subset-balanced) # possible values for 'name' field: ["subset_balanced", "overall_balanced", "not_balanced", "full_data"] dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="train", features=features) # in-domain eval/test (subset-balanced) # dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="test", features=features) dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="test", features=features) # out-of-domain test (subset-balanced) dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="test_ood", features=features) ``` ## Dataset Structure ### Data Instances ```json { "question":"Is the number of horses living on Easter Island twice the number of people?", "claim":"According to James Grant-Peterkin in his book \u201cA Companion to Easter Island\u201d, there are almost 3,000 horses on Easter Island. However, locals often claim that they have more horses than people. The population of Easter Island is about 6,000 inhabitants. So it seems that the number of horses living on Easter Island is not twice the number of people.", "claim_raw_string":"According to James Grant-Peterkin in his book \u201cA Companion to Easter Island\u201d, there are almost 3,000 horses on Easter Island. However, locals often claim that they have more horses than people. The population of Easter Island is about 6,000 inhabitants. So it seems that the number of horses living on Easter Island is not twice the number of people.", "response":"According to James Grant-Peterkin in his book \u201cA Companion to Easter Island\u201d, there are almost 3,000 horses on Easter Island. However, locals often claim that they have more horses than people. The population of Easter Island is about 6,000 inhabitants. So it seems that the number of horses living on Easter Island is not twice the number of people.", "references":[ "It is worth mentioning the huge population of horses (about 6,000) that already outnumber people and roam free on the island." ], "citation_links":[], "webpage_references":[], "attribution_label":"not attributable", "src_dataset":"AttrScore-GenSearch", "id":"AttrScore-GenSearch_7234d6e9-1f51-4203-9587-f539e34d34f4" } ``` ### Data Fields - ```question```: ```str``` The question proposed by the user. - ```claim```: ```str``` Part of the response to the question. Could be one single sentence or multiple sentences. - ```claim_raw_string```: ```str``` The raw string of the claim from the original datasets before being processed. - ```response```: ```str``` The response to the question generated by LMs or generative search engines. - ```references```: ```List[str]``` A list of documents or paragraphs which could support the claim. - ```citation_links```: ```Optional[List[str]]``` Reserved field for citation links. - ```webpage_references```: ```Optional[List[str]]``` Reserved field for the webpage contents of the reference links. - ```attribution_label```: ```str``` "attributable" or "not attributable". - ```src_dataset```: ```str``` The source dataset of the data item. - ```id```: ```str``` The unique id for the data item in AttributionBench. ## Citation <section class="section" id="BibTeX"> <div class="container is-max-desktop content"> <h2 class="title">Reference</h2> Please kindly cite our paper if you use our code, data, or results: <pre><code>@misc{li2024attributionbench, title={AttributionBench: How Hard is Automatic Attribution Evaluation?}, author={Yifei Li and Xiang Yue and Zeyi Liao and Huan Sun}, year={2024}, eprint={2402.15089}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> If used, please also cite the original datasets accordingly: <pre><code>@misc{malaviya2023expertqa, title={ExpertQA: Expert-Curated Questions and Attributed Answers}, author={Chaitanya Malaviya and Subin Lee and Sihao Chen and Elizabeth Sieber and Mark Yatskar and Dan Roth}, year={2023}, eprint={2309.07852}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> <pre><code>@inproceedings{liu-etal-2023-evaluating, title = "Evaluating Verifiability in Generative Search Engines", author = "Liu, Nelson and Zhang, Tianyi and Liang, Percy", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.467", doi = "10.18653/v1/2023.findings-emnlp.467", pages = "7001--7025", abstract = "Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines{---}Bing Chat, NeevaAI, perplexity.ai, and YouChat{---}across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5{\%} of generated sentences are fully supported by citations and only 74.5{\%} of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.", } </code></pre> <pre><code>@misc{bohnet2023attributed, title={Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models}, author={Bernd Bohnet and Vinh Q. Tran and Pat Verga and Roee Aharoni and Daniel Andor and Livio Baldini Soares and Massimiliano Ciaramita and Jacob Eisenstein and Kuzman Ganchev and Jonathan Herzig and Kai Hui and Tom Kwiatkowski and Ji Ma and Jianmo Ni and Lierni Sestorain Saralegui and Tal Schuster and William W. Cohen and Michael Collins and Dipanjan Das and Donald Metzler and Slav Petrov and Kellie Webster}, year={2023}, eprint={2212.08037}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> <pre><code>@misc{chen2023understanding, title={Understanding Retrieval Augmentation for Long-Form Question Answering}, author={Hung-Ting Chen and Fangyuan Xu and Shane Arora and Eunsol Choi}, year={2023}, eprint={2310.12150}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> <pre><code>@article{dziri-etal-2022-evaluating, title = "Evaluating Attribution in Dialogue Systems: The {BEGIN} Benchmark", author = "Dziri, Nouha and Rashkin, Hannah and Linzen, Tal and Reitter, David", editor = "Roark, Brian and Nenkova, Ani", journal = "Transactions of the Association for Computational Linguistics", volume = "10", year = "2022", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2022.tacl-1.62", doi = "10.1162/tacl_a_00506", pages = "1066--1083", abstract = "Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models{'} responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at \url{https://github.com/google/BEGIN-dataset}.", } </code></pre> <pre><code>@inproceedings{yue-etal-2023-automatic, title = "Automatic Evaluation of Attribution by Large Language Models", author = "Yue, Xiang and Wang, Boshi and Chen, Ziru and Zhang, Kai and Su, Yu and Sun, Huan", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.307", doi = "10.18653/v1/2023.findings-emnlp.307", pages = "4615--4635", abstract = "A recent focus of large language model (LLM) development, as exemplified by generative search engines, is to incorporate external references to generate and support its claims. However, evaluating the attribution, i.e., verifying whether the generated statement is fully supported by the cited reference, remains an open problem. Although human evaluation is common practice, it is costly and time-consuming. In this paper, we investigate automatic evaluation of attribution given by LLMs. We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed from related tasks such as question answering, fact-checking, natural language inference, and summarization. We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing. Our results on this curated test set and simulated examples from existing benchmarks highlight both promising signals and challenges. We hope our problem formulation, testbeds, and findings will help lay the foundation for future studies on this important problem.", } </code></pre> </code></pre> <pre><code>@misc{kamalloo2023hagrid, title={HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution}, author={Ehsan Kamalloo and Aref Jafari and Xinyu Zhang and Nandan Thakur and Jimmy Lin}, year={2023}, eprint={2307.16883}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre>

# AttributionBench 数据集卡片 - GitHub 仓库：<a href="https://github.com/OSU-NLP-Group/AttributionBench/">[GitHub]</a> - 相关论文：<a href="https://arxiv.org/abs/2402.15089">AttributionBench：自动归因评估难度几何？</a> - 联系人：<a href="mailto:li.14042@osu.edu">李亦飞</a> # 数据集概览本数据集从多个现有数据源以统一格式构建，旨在打造一个统一且多样化的测试平台，用于评估先进的归因评估系统。数据集包含域内训练集，以及域内与域外测试集。 ## 使用方法 python import datasets features = datasets.Features({ 'question': datasets.Value('string'), 'claim': datasets.Value('string'), 'claim_raw_string': datasets.Value('string'), 'response': datasets.Value('string'), 'references': datasets.Sequence(datasets.Value("string")), 'citation_links': datasets.Sequence(datasets.Value("string")), 'webpage_references': datasets.Sequence(datasets.Value("string")), 'attribution_label': datasets.Value('string'), 'src_dataset': datasets.Value('string'), 'id': datasets.Value('string'), }) # 域内训练集（子集平衡版） # 'name' 字段可选值：["subset_balanced", "overall_balanced", "not_balanced", "full_data"] dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="train", features=features) # 域内评估/测试集（子集平衡版） # dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="test", features=features) dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="test", features=features) # 域外测试集（子集平衡版） dataset = datasets.load_dataset("osunlp/AttributionBench", name="subset_balanced", split="test_ood", features=features) ## 数据集结构 ### 数据样例 json { "question":"复活节岛上的马匹数量是否是人类数量的两倍？", "claim":"根据詹姆斯·格兰特-彼得金在其著作《复活节岛指南》中的记载，复活节岛上现存马匹约3000匹。然而，当地居民常称其拥有的马匹数量超过人类。复活节岛的总人口约为6000人。因此，复活节岛上的马匹数量似乎并未达到人类数量的两倍。", "claim_raw_string":"According to James Grant-Peterkin in his book “A Companion to Easter Island”, there are almost 3,000 horses on Easter Island. However, locals often claim that they have more horses than people. The population of Easter Island is about 6,000 inhabitants. So it seems that the number of horses living on Easter Island is not twice the number of people.", "response":"According to James Grant-Peterkin in his book “A Companion to Easter Island”, there are almost 3,000 horses on Easter Island. However, locals often claim that they have more horses than people. The population of Easter Island is about 6,000 inhabitants. So it seems that the number of horses living on Easter Island is not twice the number of people.", "references":[ "值得一提的是，岛上马匹数量庞大（约6000匹），数量超过人类且自由漫游。" ], "citation_links":[], "webpage_references":[], "attribution_label":"not attributable", "src_dataset":"AttrScore-GenSearch", "id":"AttrScore-GenSearch_7234d6e9-1f51-4203-9587-f539e34d34f4" } ### 数据字段说明 - question: str 用户提出的查询问题。 - claim: str 模型响应中针对该问题的断言内容，可单句或多句。 - claim_raw_string: str 处理前来自原始数据集的断言原始字符串。 - response: str 由大语言模型（Large Language Model）或生成式搜索引擎生成的针对用户查询的完整响应。 - references: List[str] 可用于佐证该断言的文档或段落列表。 - citation_links: Optional[List[str]] 用于存储引用链接的预留字段。 - webpage_references: Optional[List[str]] 用于存储引用链接对应网页内容的预留字段。 - attribution_label: str 取值为"attributable"（可归因）或"not attributable"（不可归因）。 - src_dataset: str 该数据项的来源数据集名称。 - id: str AttributionBench中该数据项的唯一标识符。 ## 参考文献 <section class="section" id="BibTeX"> <div class="container is-max-desktop content"> <h2 class="title">引用说明</h2> 若您使用了本项目的代码、数据或研究结果，请引用我们的论文： <pre><code>@misc{li2024attributionbench, title={AttributionBench: How Hard is Automatic Attribution Evaluation?}, author={Yifei Li and Xiang Yue and Zeyi Liao and Huan Sun}, year={2024}, eprint={2402.15089}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> 若使用了本数据集包含的原始数据，请同时引用对应的原始数据集文献： <pre><code>@misc{malaviya2023expertqa, title={ExpertQA: Expert-Curated Questions and Attributed Answers}, author={Chaitanya Malaviya and Subin Lee and Sihao Chen and Elizabeth Sieber and Mark Yatskar and Dan Roth}, year={2023}, eprint={2309.07852}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> <pre><code>@inproceedings{liu-etal-2023-evaluating, title = "Evaluating Verifiability in Generative Search Engines", author = "Liu, Nelson and Zhang, Tianyi and Liang, Percy", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.467", doi = "10.18653/v1/2023.findings-emnlp.467", pages = "7001--7025", abstract = "Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines---Bing Chat, NeevaAI, perplexity.ai, and YouChat---across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.", } </code></pre> <pre><code>@misc{bohnet2023attributed, title={Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models}, author={Bernd Bohnet and Vinh Q. Tran and Pat Verga and Roee Aharoni and Daniel Andor and Livio Baldini Soares and Massimiliano Ciaramita and Jacob Eisenstein and Kuzman Ganchev and Jonathan Herzig and Kai Hui and Tom Kwiatkowski and Ji Ma and Jianmo Ni and Lierni Sestorain Saralegui and Tal Schuster and William W. Cohen and Michael Collins and Dipanjan Das and Donald Metzler and Slav Petrov and Kellie Webster}, year={2023}, eprint={2212.08037}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> <pre><code>@misc{chen2023understanding, title={Understanding Retrieval Augmentation for Long-Form Question Answering}, author={Hung-Ting Chen and Fangyuan Xu and Shane Arora and Eunsol Choi}, year={2023}, eprint={2310.12150}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre> <pre><code>@article{dziri-etal-2022-evaluating, title = "Evaluating Attribution in Dialogue Systems: The {BEGIN} Benchmark", author = "Dziri, Nouha and Rashkin, Hannah and Linzen, Tal and Reitter, David", editor = "Roark, Brian and Nenkova, Ani", journal = "Transactions of the Association for Computational Linguistics", volume = "10", year = "2022", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2022.tacl-1.62", doi = "10.1162/tacl_a_00506", pages = "1066--1083", abstract = "Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at url{https://github.com/google/BEGIN-dataset}.", } </code></pre> <pre><code>@inproceedings{yue-etal-2023-automatic, title = "Automatic Evaluation of Attribution by Large Language Models", author = "Yue, Xiang and Wang, Boshi and Chen, Ziru and Zhang, Kai and Su, Yu and Sun, Huan", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.307", doi = "10.18653/v1/2023.findings-emnlp.307", pages = "4615--4635", abstract = "A recent focus of large language model (LLM) development, as exemplified by generative search engines, is to incorporate external references to generate and support its claims. However, evaluating the attribution, i.e., verifying whether the generated statement is fully supported by the cited reference, remains an open problem. Although human evaluation is common practice, it is costly and time-consuming. In this paper, we investigate automatic evaluation of attribution given by LLMs. We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed from related tasks such as question answering, fact-checking, natural language inference, and summarization. We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing. Our results on this curated test set and simulated examples from existing benchmarks highlight both promising signals and challenges. We hope our problem formulation, testbeds, and findings will help lay the foundation for future studies on this important problem.", } </code></pre> <pre><code>@misc{kamalloo2023hagrid, title={HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution}, author={Ehsan Kamalloo and Aref Jafari and Xinyu Zhang and Nandan Thakur and Jimmy Lin}, year={2023}, eprint={2307.16883}, archivePrefix={arXiv}, primaryClass={cs.CL} } </code></pre>

提供机构：

maas

创建时间：

2025-07-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集