five

fangyuan/lfqa_discourse

收藏
Hugging Face2024-02-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fangyuan/lfqa_discourse
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc language: - en size_categories: - 1K<n<10K --- # Dataset Card for LFQA Discourse ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Repo](https://github.com/utcsnlp/lfqa_discourse) - **Paper:** [How Do We Answer Complex Questions: Discourse Structure of Long-form Answers](https://arxiv.org/abs/2203.11048) - **Point of Contact:** fangyuan[at]utexas.edu ### Dataset Summary This dataset contains discourse annotation of long-form answers. There are two types of annotations: * **Validity:** whether a <question, answer> pair is valid based on a set of invalid reasons defined. * **Role:** sentence-level role annotation of functional roles for long-form answers. ### Languages The dataset contains data in English. ## Dataset Structure ### Data Instances Each instance is a (question, long-form answer) pair from one of the four data sources -- ELI5, WebGPT, NQ, and model-generated answers (denoted as ELI5-model), and our discourse annotation, which consists of QA-pair level validity label and sentence-level functional role label. We provide all validity and role annotations here. For further train/val/test split, please refer to our [github repository](https://github.com/utcsnlp/lfqa_discourse). ### Data Fields For validity annotations, each instance contains the following fields: * `dataset`: The dataset this QA pair belongs to, one of [`NQ`, `ELI5`, `Web-GPT`]. Note that `ELI5` contains both human-written answers and model-generated answers, with model-generated answer distinguished with the `a_id` field mentioned below. * `q_id`: The question id, same as the original NQ or ELI5 dataset. * `a_id`: The answer id, same as the original ELI5 dataset. For NQ, we populate a dummy `a_id` (1). For machine generated answers, this field corresponds to the name of the model. * `question`: The question. * `answer_paragraph`: The answer paragraph. * `answer_sentences`: The list of answer sentences, tokenized from the answer paragraph. * `is_valid`: A boolean value indicating whether the qa pair is valid, values: [`True`, `False`]. * `invalid_reason`: A list of list, each list contains the invalid reason the annotator selected. The invalid reason is one of [`no_valid_answer`, `nonsensical_question`, `assumptions_rejected`, `multiple_questions`]. For role annotations, each instance contains the following fields: * * `dataset`: The dataset this QA pair belongs to, one of [`NQ`, `ELI5`, `Web-GPT`]. Note that `ELI5` contains both human-written answers and model-generated answers, with model-generated answer distinguished with the `a_id` field mentioned below. * `q_id`: The question id, same as the original NQ or ELI5 dataset. * `a_id`: The answer id, same as the original ELI5 dataset. For NQ, we populate a dummy `a_id` (1). For machine generated answers, this field corresponds to the name of the model. * `question`: The question. * `answer_paragraph`: The answer paragraph. * `answer_sentences`: The list of answer sentences, tokenized from the answer paragraph. * `role_annotation`: The list of majority role (or adjudicated) role (if exists), for the sentences in `answer_sentences`. Each role is one of [`Answer`, `Answer - Example`, `Answer (Summary)`, `Auxiliary Information`, `Answer - Organizational sentence`, `Miscellaneous`] * `raw_role_annotation`: A list of list, each list contains the raw role annotations for sentences in `answer_sentences`. ### Data Splits For train/validation/test splits, please refer to our [repository]((https://github.com/utcsnlp/lfqa_discourse). ## Dataset Creation Please refer to our [paper](https://arxiv.org/abs/2203.11048) and datasheet for details on dataset creation, annotation process and discussion on limitations. ## Additional Information ### Licensing Information https://creativecommons.org/licenses/by-sa/4.0/legalcode ### Citation Information ``` @inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} } ``` ### Contributions Thanks to [@carriex](https://github.com/carriex) for adding this dataset.

许可证:CC 语言: - 英语 数据规模分类: - 1000 < 样本数 < 10000 --- # LFQA话语数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [语言说明](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **代码仓库:** [仓库链接](https://github.com/utcsnlp/lfqa_discourse) - **相关论文:** [《我们如何解答复杂问题:长式答案的话语结构》](https://arxiv.org/abs/2203.11048) - **联系方式:** fangyuan[at]utexas.edu ### 数据集摘要 本数据集包含长式答案的话语标注,共包含两类标注任务: * **有效性标注:** 基于预设的无效判定依据,判断<问题,答案>配对是否有效。 * **角色标注:** 对长式答案进行句子级别的功能角色标注。 ### 语言说明 本数据集仅包含英语数据。 ## 数据集结构 ### 数据实例 每个数据实例均为一组(问题,长式答案)配对,数据来源涵盖四类:ELI5、WebGPT、NQ,以及模型生成答案(标注为ELI5-model);同时附带本数据集的话语标注,包括QA配对级别的有效性标签,以及句子级别的功能角色标签。 本页面已提供全部有效性与角色标注内容。如需获取训练集、验证集与测试集的划分方式,请参阅我们的[GitHub代码仓库](https://github.com/utcsnlp/lfqa_discourse)。 ### 数据字段 针对有效性标注,每个数据实例包含以下字段: * `dataset`:该QA配对所属的数据集,可选值为 [`NQ`、`ELI5`、`Web-GPT`]。请注意,`ELI5` 同时包含人工撰写答案与模型生成答案,模型生成答案将通过下文提及的`a_id`字段进行区分。 * `q_id`:问题ID,与原始NQ或ELI5数据集的ID保持一致。 * `a_id`:答案ID,与原始ELI5数据集的ID保持一致。针对NQ数据集,我们设置了占位符`a_id`(值为1);针对模型生成答案,该字段将存储模型名称。 * `question`:问题文本。 * `answer_paragraph`:答案段落文本。 * `answer_sentences`:答案句子列表,由答案段落分词得到。 * `is_valid`:布尔类型字段,用于标识该QA配对是否有效,可选值为 [`True`、`False`]。 * `invalid_reason`:二维列表类型字段,每个子列表代表标注者选择的无效原因。无效原因可选值为 [`no_valid_answer`、`nonsensical_question`、`assumptions_rejected`、`multiple_questions`]。 针对角色标注,每个数据实例包含以下字段: * `dataset`:该QA配对所属的数据集,可选值为 [`NQ`、`ELI5`、`Web-GPT`]。请注意,`ELI5` 同时包含人工撰写答案与模型生成答案,模型生成答案将通过`a_id`字段进行区分。 * `q_id`:问题ID,与原始NQ或ELI5数据集的ID保持一致。 * `a_id`:答案ID,与原始ELI5数据集的ID保持一致。针对NQ数据集,我们设置了占位符`a_id`(值为1);针对模型生成答案,该字段将存储模型名称。 * `question`:问题文本。 * `answer_paragraph`:答案段落文本。 * `answer_sentences`:答案句子列表,由答案段落分词得到。 * `role_annotation`:角色列表字段,存储`answer_sentences`中每个句子的多数投票角色(或经仲裁确定的角色,若存在)。每个角色可选值为 [`Answer`、`Answer - 示例`、`Answer (摘要)`、`辅助信息`、`Answer - 组织语句`、`其他`]。 * `raw_role_annotation`:二维列表类型字段,每个子列表代表`answer_sentences`中对应句子的原始标注结果。 ### 数据划分 关于训练集、验证集与测试集的划分方式,请参阅我们的[GitHub代码仓库](https://github.com/utcsnlp/lfqa_discourse)。 ## 数据集构建 有关数据集构建、标注流程以及局限性讨论的详细内容,请参阅我们的[相关论文](https://arxiv.org/abs/2203.11048)与数据集说明文档。 ## 附加信息 ### 数据集维护者 ### 许可证信息 https://creativecommons.org/licenses/by-sa/4.0/legalcode ### 引用信息 @inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} } ### 贡献致谢 感谢 [@carriex](https://github.com/carriex) 为本数据集添加内容。
提供机构:
fangyuan
原始信息汇总

数据集概述

数据集名称

LFQA Discourse

数据集摘要

该数据集包含长形式答案的论述注释,主要包括两种类型的注释:

  • 有效性:基于定义的一组无效原因,判断<问题, 答案>对是否有效。
  • 角色:长形式答案的句子级功能角色注释。

语言

数据集包含英文数据。

数据结构

数据实例

每个实例是一个来自四个数据源之一的问题和长形式答案对,包括ELI5、WebGPT、NQ和模型生成的答案(标记为ELI5-model),以及我们的论述注释,包括QA对级别的有效性标签和句子级别的功能角色标签。

数据字段
  • 有效性注释

    • dataset: 数据集来源,包括NQ, ELI5, Web-GPT
    • q_id: 问题ID。
    • a_id: 答案ID。
    • question: 问题文本。
    • answer_paragraph: 答案段落。
    • answer_sentences: 答案句子的列表。
    • is_valid: 布尔值,表示QA对是否有效。
    • invalid_reason: 无效原因列表。
  • 角色注释

    • dataset: 数据集来源。
    • q_id: 问题ID。
    • a_id: 答案ID。
    • question: 问题文本。
    • answer_paragraph: 答案段落。
    • answer_sentences: 答案句子的列表。
    • role_annotation: 句子在answer_sentences中的主要角色或裁决角色列表。
    • raw_role_annotation: 原始角色注释列表。

数据分割

关于训练/验证/测试分割的详细信息,请参考GitHub仓库

许可证信息

数据集遵循Creative Commons Attribution-ShareAlike 4.0 International License

引用信息

@inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作