fangyuan/lfqa_discourse

Name: fangyuan/lfqa_discourse
Creator: fangyuan
Published: 2024-02-28 16:20:37
License: 暂无描述

Hugging Face2024-02-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/fangyuan/lfqa_discourse

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc language: - en size_categories: - 1K<n<10K --- # Dataset Card for LFQA Discourse ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Repo](https://github.com/utcsnlp/lfqa_discourse) - **Paper:** [How Do We Answer Complex Questions: Discourse Structure of Long-form Answers](https://arxiv.org/abs/2203.11048) - **Point of Contact:** fangyuan[at]utexas.edu ### Dataset Summary This dataset contains discourse annotation of long-form answers. There are two types of annotations: * **Validity:** whether a <question, answer> pair is valid based on a set of invalid reasons defined. * **Role:** sentence-level role annotation of functional roles for long-form answers. ### Languages The dataset contains data in English. ## Dataset Structure ### Data Instances Each instance is a (question, long-form answer) pair from one of the four data sources -- ELI5, WebGPT, NQ, and model-generated answers (denoted as ELI5-model), and our discourse annotation, which consists of QA-pair level validity label and sentence-level functional role label. We provide all validity and role annotations here. For further train/val/test split, please refer to our [github repository](https://github.com/utcsnlp/lfqa_discourse). ### Data Fields For validity annotations, each instance contains the following fields: * `dataset`: The dataset this QA pair belongs to, one of [`NQ`, `ELI5`, `Web-GPT`]. Note that `ELI5` contains both human-written answers and model-generated answers, with model-generated answer distinguished with the `a_id` field mentioned below. * `q_id`: The question id, same as the original NQ or ELI5 dataset. * `a_id`: The answer id, same as the original ELI5 dataset. For NQ, we populate a dummy `a_id` (1). For machine generated answers, this field corresponds to the name of the model. * `question`: The question. * `answer_paragraph`: The answer paragraph. * `answer_sentences`: The list of answer sentences, tokenized from the answer paragraph. * `is_valid`: A boolean value indicating whether the qa pair is valid, values: [`True`, `False`]. * `invalid_reason`: A list of list, each list contains the invalid reason the annotator selected. The invalid reason is one of [`no_valid_answer`, `nonsensical_question`, `assumptions_rejected`, `multiple_questions`]. For role annotations, each instance contains the following fields: * * `dataset`: The dataset this QA pair belongs to, one of [`NQ`, `ELI5`, `Web-GPT`]. Note that `ELI5` contains both human-written answers and model-generated answers, with model-generated answer distinguished with the `a_id` field mentioned below. * `q_id`: The question id, same as the original NQ or ELI5 dataset. * `a_id`: The answer id, same as the original ELI5 dataset. For NQ, we populate a dummy `a_id` (1). For machine generated answers, this field corresponds to the name of the model. * `question`: The question. * `answer_paragraph`: The answer paragraph. * `answer_sentences`: The list of answer sentences, tokenized from the answer paragraph. * `role_annotation`: The list of majority role (or adjudicated) role (if exists), for the sentences in `answer_sentences`. Each role is one of [`Answer`, `Answer - Example`, `Answer (Summary)`, `Auxiliary Information`, `Answer - Organizational sentence`, `Miscellaneous`] * `raw_role_annotation`: A list of list, each list contains the raw role annotations for sentences in `answer_sentences`. ### Data Splits For train/validation/test splits, please refer to our [repository]((https://github.com/utcsnlp/lfqa_discourse). ## Dataset Creation Please refer to our [paper](https://arxiv.org/abs/2203.11048) and datasheet for details on dataset creation, annotation process and discussion on limitations. ## Additional Information ### Licensing Information https://creativecommons.org/licenses/by-sa/4.0/legalcode ### Citation Information ``` @inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} } ``` ### Contributions Thanks to [@carriex](https://github.com/carriex) for adding this dataset.

许可证：CC 语言： - 英语数据规模分类： - 1000 < 样本数 < 10000 --- # LFQA话语数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [语言说明](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **代码仓库：** [仓库链接](https://github.com/utcsnlp/lfqa_discourse) - **相关论文：** [《我们如何解答复杂问题：长式答案的话语结构》](https://arxiv.org/abs/2203.11048) - **联系方式：** fangyuan[at]utexas.edu ### 数据集摘要本数据集包含长式答案的话语标注，共包含两类标注任务： * **有效性标注：** 基于预设的无效判定依据，判断<问题，答案>配对是否有效。 * **角色标注：** 对长式答案进行句子级别的功能角色标注。 ### 语言说明本数据集仅包含英语数据。 ## 数据集结构 ### 数据实例每个数据实例均为一组（问题，长式答案）配对，数据来源涵盖四类：ELI5、WebGPT、NQ，以及模型生成答案（标注为ELI5-model）；同时附带本数据集的话语标注，包括QA配对级别的有效性标签，以及句子级别的功能角色标签。本页面已提供全部有效性与角色标注内容。如需获取训练集、验证集与测试集的划分方式，请参阅我们的[GitHub代码仓库](https://github.com/utcsnlp/lfqa_discourse)。 ### 数据字段针对有效性标注，每个数据实例包含以下字段： * `dataset`：该QA配对所属的数据集，可选值为 [`NQ`、`ELI5`、`Web-GPT`]。请注意，`ELI5` 同时包含人工撰写答案与模型生成答案，模型生成答案将通过下文提及的`a_id`字段进行区分。 * `q_id`：问题ID，与原始NQ或ELI5数据集的ID保持一致。 * `a_id`：答案ID，与原始ELI5数据集的ID保持一致。针对NQ数据集，我们设置了占位符`a_id`（值为1）；针对模型生成答案，该字段将存储模型名称。 * `question`：问题文本。 * `answer_paragraph`：答案段落文本。 * `answer_sentences`：答案句子列表，由答案段落分词得到。 * `is_valid`：布尔类型字段，用于标识该QA配对是否有效，可选值为 [`True`、`False`]。 * `invalid_reason`：二维列表类型字段，每个子列表代表标注者选择的无效原因。无效原因可选值为 [`no_valid_answer`、`nonsensical_question`、`assumptions_rejected`、`multiple_questions`]。针对角色标注，每个数据实例包含以下字段： * `dataset`：该QA配对所属的数据集，可选值为 [`NQ`、`ELI5`、`Web-GPT`]。请注意，`ELI5` 同时包含人工撰写答案与模型生成答案，模型生成答案将通过`a_id`字段进行区分。 * `q_id`：问题ID，与原始NQ或ELI5数据集的ID保持一致。 * `a_id`：答案ID，与原始ELI5数据集的ID保持一致。针对NQ数据集，我们设置了占位符`a_id`（值为1）；针对模型生成答案，该字段将存储模型名称。 * `question`：问题文本。 * `answer_paragraph`：答案段落文本。 * `answer_sentences`：答案句子列表，由答案段落分词得到。 * `role_annotation`：角色列表字段，存储`answer_sentences`中每个句子的多数投票角色（或经仲裁确定的角色，若存在）。每个角色可选值为 [`Answer`、`Answer - 示例`、`Answer (摘要)`、`辅助信息`、`Answer - 组织语句`、`其他`]。 * `raw_role_annotation`：二维列表类型字段，每个子列表代表`answer_sentences`中对应句子的原始标注结果。 ### 数据划分关于训练集、验证集与测试集的划分方式，请参阅我们的[GitHub代码仓库](https://github.com/utcsnlp/lfqa_discourse)。 ## 数据集构建有关数据集构建、标注流程以及局限性讨论的详细内容，请参阅我们的[相关论文](https://arxiv.org/abs/2203.11048)与数据集说明文档。 ## 附加信息 ### 数据集维护者 ### 许可证信息 https://creativecommons.org/licenses/by-sa/4.0/legalcode ### 引用信息 @inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} } ### 贡献致谢感谢 [@carriex](https://github.com/carriex) 为本数据集添加内容。

提供机构：

fangyuan

原始信息汇总

数据集概述

数据集名称

LFQA Discourse

数据集摘要

该数据集包含长形式答案的论述注释，主要包括两种类型的注释：

有效性：基于定义的一组无效原因，判断<问题, 答案>对是否有效。
角色：长形式答案的句子级功能角色注释。

语言

数据集包含英文数据。

数据结构

数据实例

每个实例是一个来自四个数据源之一的问题和长形式答案对，包括ELI5、WebGPT、NQ和模型生成的答案（标记为ELI5-model），以及我们的论述注释，包括QA对级别的有效性标签和句子级别的功能角色标签。

数据字段

有效性注释：
- dataset: 数据集来源，包括NQ, ELI5, Web-GPT。
- q_id: 问题ID。
- a_id: 答案ID。
- question: 问题文本。
- answer_paragraph: 答案段落。
- answer_sentences: 答案句子的列表。
- is_valid: 布尔值，表示QA对是否有效。
- invalid_reason: 无效原因列表。
角色注释：
- dataset: 数据集来源。
- q_id: 问题ID。
- a_id: 答案ID。
- question: 问题文本。
- answer_paragraph: 答案段落。
- answer_sentences: 答案句子的列表。
- role_annotation: 句子在answer_sentences中的主要角色或裁决角色列表。
- raw_role_annotation: 原始角色注释列表。

数据分割

关于训练/验证/测试分割的详细信息，请参考GitHub仓库。

许可证信息

数据集遵循Creative Commons Attribution-ShareAlike 4.0 International License。

引用信息

@inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} }

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集