fangyuan/lfqa_discourse
收藏Hugging Face2024-02-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fangyuan/lfqa_discourse
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
language:
- en
size_categories:
- 1K<n<10K
---
# Dataset Card for LFQA Discourse
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** [Repo](https://github.com/utcsnlp/lfqa_discourse)
- **Paper:** [How Do We Answer Complex Questions: Discourse Structure of Long-form Answers](https://arxiv.org/abs/2203.11048)
- **Point of Contact:** fangyuan[at]utexas.edu
### Dataset Summary
This dataset contains discourse annotation of long-form answers. There are two types of annotations:
* **Validity:** whether a <question, answer> pair is valid based on a set of invalid reasons defined.
* **Role:** sentence-level role annotation of functional roles for long-form answers.
### Languages
The dataset contains data in English.
## Dataset Structure
### Data Instances
Each instance is a (question, long-form answer) pair from one of the four data sources -- ELI5, WebGPT, NQ, and model-generated answers (denoted as ELI5-model), and our discourse annotation, which consists of QA-pair level validity label and sentence-level functional role label.
We provide all validity and role annotations here. For further train/val/test split, please refer to our [github repository](https://github.com/utcsnlp/lfqa_discourse).
### Data Fields
For validity annotations, each instance contains the following fields:
* `dataset`: The dataset this QA pair belongs to, one of [`NQ`, `ELI5`, `Web-GPT`]. Note that `ELI5` contains both human-written answers and model-generated answers, with model-generated answer distinguished with the `a_id` field mentioned below.
* `q_id`: The question id, same as the original NQ or ELI5 dataset.
* `a_id`: The answer id, same as the original ELI5 dataset. For NQ, we populate a dummy `a_id` (1). For machine generated answers, this field corresponds to the name of the model.
* `question`: The question.
* `answer_paragraph`: The answer paragraph.
* `answer_sentences`: The list of answer sentences, tokenized from the answer paragraph.
* `is_valid`: A boolean value indicating whether the qa pair is valid, values: [`True`, `False`].
* `invalid_reason`: A list of list, each list contains the invalid reason the annotator selected. The invalid reason is one of [`no_valid_answer`, `nonsensical_question`, `assumptions_rejected`, `multiple_questions`].
For role annotations, each instance contains the following fields:
*
* `dataset`: The dataset this QA pair belongs to, one of [`NQ`, `ELI5`, `Web-GPT`]. Note that `ELI5` contains both human-written answers and model-generated answers, with model-generated answer distinguished with the `a_id` field mentioned below.
* `q_id`: The question id, same as the original NQ or ELI5 dataset.
* `a_id`: The answer id, same as the original ELI5 dataset. For NQ, we populate a dummy `a_id` (1). For machine generated answers, this field corresponds to the name of the model.
* `question`: The question.
* `answer_paragraph`: The answer paragraph.
* `answer_sentences`: The list of answer sentences, tokenized from the answer paragraph.
* `role_annotation`: The list of majority role (or adjudicated) role (if exists), for the sentences in `answer_sentences`. Each role is one of [`Answer`, `Answer - Example`, `Answer (Summary)`, `Auxiliary Information`, `Answer - Organizational sentence`, `Miscellaneous`]
* `raw_role_annotation`: A list of list, each list contains the raw role annotations for sentences in `answer_sentences`.
### Data Splits
For train/validation/test splits, please refer to our [repository]((https://github.com/utcsnlp/lfqa_discourse).
## Dataset Creation
Please refer to our [paper](https://arxiv.org/abs/2203.11048) and datasheet for details on dataset creation, annotation process and discussion on limitations.
## Additional Information
### Licensing Information
https://creativecommons.org/licenses/by-sa/4.0/legalcode
### Citation Information
```
@inproceedings{xu2022lfqadiscourse,
title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers},
author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol},
year = 2022,
booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics},
note = {Long paper}
}
```
### Contributions
Thanks to [@carriex](https://github.com/carriex) for adding this dataset.
许可证:CC
语言:
- 英语
数据规模分类:
- 1000 < 样本数 < 10000
---
# LFQA话语数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [语言说明](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可证信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **代码仓库:** [仓库链接](https://github.com/utcsnlp/lfqa_discourse)
- **相关论文:** [《我们如何解答复杂问题:长式答案的话语结构》](https://arxiv.org/abs/2203.11048)
- **联系方式:** fangyuan[at]utexas.edu
### 数据集摘要
本数据集包含长式答案的话语标注,共包含两类标注任务:
* **有效性标注:** 基于预设的无效判定依据,判断<问题,答案>配对是否有效。
* **角色标注:** 对长式答案进行句子级别的功能角色标注。
### 语言说明
本数据集仅包含英语数据。
## 数据集结构
### 数据实例
每个数据实例均为一组(问题,长式答案)配对,数据来源涵盖四类:ELI5、WebGPT、NQ,以及模型生成答案(标注为ELI5-model);同时附带本数据集的话语标注,包括QA配对级别的有效性标签,以及句子级别的功能角色标签。
本页面已提供全部有效性与角色标注内容。如需获取训练集、验证集与测试集的划分方式,请参阅我们的[GitHub代码仓库](https://github.com/utcsnlp/lfqa_discourse)。
### 数据字段
针对有效性标注,每个数据实例包含以下字段:
* `dataset`:该QA配对所属的数据集,可选值为 [`NQ`、`ELI5`、`Web-GPT`]。请注意,`ELI5` 同时包含人工撰写答案与模型生成答案,模型生成答案将通过下文提及的`a_id`字段进行区分。
* `q_id`:问题ID,与原始NQ或ELI5数据集的ID保持一致。
* `a_id`:答案ID,与原始ELI5数据集的ID保持一致。针对NQ数据集,我们设置了占位符`a_id`(值为1);针对模型生成答案,该字段将存储模型名称。
* `question`:问题文本。
* `answer_paragraph`:答案段落文本。
* `answer_sentences`:答案句子列表,由答案段落分词得到。
* `is_valid`:布尔类型字段,用于标识该QA配对是否有效,可选值为 [`True`、`False`]。
* `invalid_reason`:二维列表类型字段,每个子列表代表标注者选择的无效原因。无效原因可选值为 [`no_valid_answer`、`nonsensical_question`、`assumptions_rejected`、`multiple_questions`]。
针对角色标注,每个数据实例包含以下字段:
* `dataset`:该QA配对所属的数据集,可选值为 [`NQ`、`ELI5`、`Web-GPT`]。请注意,`ELI5` 同时包含人工撰写答案与模型生成答案,模型生成答案将通过`a_id`字段进行区分。
* `q_id`:问题ID,与原始NQ或ELI5数据集的ID保持一致。
* `a_id`:答案ID,与原始ELI5数据集的ID保持一致。针对NQ数据集,我们设置了占位符`a_id`(值为1);针对模型生成答案,该字段将存储模型名称。
* `question`:问题文本。
* `answer_paragraph`:答案段落文本。
* `answer_sentences`:答案句子列表,由答案段落分词得到。
* `role_annotation`:角色列表字段,存储`answer_sentences`中每个句子的多数投票角色(或经仲裁确定的角色,若存在)。每个角色可选值为 [`Answer`、`Answer - 示例`、`Answer (摘要)`、`辅助信息`、`Answer - 组织语句`、`其他`]。
* `raw_role_annotation`:二维列表类型字段,每个子列表代表`answer_sentences`中对应句子的原始标注结果。
### 数据划分
关于训练集、验证集与测试集的划分方式,请参阅我们的[GitHub代码仓库](https://github.com/utcsnlp/lfqa_discourse)。
## 数据集构建
有关数据集构建、标注流程以及局限性讨论的详细内容,请参阅我们的[相关论文](https://arxiv.org/abs/2203.11048)与数据集说明文档。
## 附加信息
### 数据集维护者
### 许可证信息
https://creativecommons.org/licenses/by-sa/4.0/legalcode
### 引用信息
@inproceedings{xu2022lfqadiscourse,
title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers},
author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol},
year = 2022,
booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics},
note = {Long paper}
}
### 贡献致谢
感谢 [@carriex](https://github.com/carriex) 为本数据集添加内容。
提供机构:
fangyuan
原始信息汇总
数据集概述
数据集名称
LFQA Discourse
数据集摘要
该数据集包含长形式答案的论述注释,主要包括两种类型的注释:
- 有效性:基于定义的一组无效原因,判断<问题, 答案>对是否有效。
- 角色:长形式答案的句子级功能角色注释。
语言
数据集包含英文数据。
数据结构
数据实例
每个实例是一个来自四个数据源之一的问题和长形式答案对,包括ELI5、WebGPT、NQ和模型生成的答案(标记为ELI5-model),以及我们的论述注释,包括QA对级别的有效性标签和句子级别的功能角色标签。
数据字段
-
有效性注释:
dataset: 数据集来源,包括NQ,ELI5,Web-GPT。q_id: 问题ID。a_id: 答案ID。question: 问题文本。answer_paragraph: 答案段落。answer_sentences: 答案句子的列表。is_valid: 布尔值,表示QA对是否有效。invalid_reason: 无效原因列表。
-
角色注释:
dataset: 数据集来源。q_id: 问题ID。a_id: 答案ID。question: 问题文本。answer_paragraph: 答案段落。answer_sentences: 答案句子的列表。role_annotation: 句子在answer_sentences中的主要角色或裁决角色列表。raw_role_annotation: 原始角色注释列表。
数据分割
关于训练/验证/测试分割的详细信息,请参考GitHub仓库。
许可证信息
数据集遵循Creative Commons Attribution-ShareAlike 4.0 International License。
引用信息
@inproceedings{xu2022lfqadiscourse, title = {How Do We Answer Complex Questions: Discourse Structure of Long-form Answers}, author = {Xu, Fangyuan and Li, Junyi Jessy and Choi, Eunsol}, year = 2022, booktitle = {Proceedings of the Annual Meeting of the Association for Computational Linguistics}, note = {Long paper} }
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



