lmqg/qg_squad
收藏Hugging Face2022-12-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lmqg/qg_squad
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
pretty_name: SQuAD for question generation
language: en
multilinguality: monolingual
size_categories: 10K<n<100K
source_datasets: squad
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- question-generation
---
# Dataset Card for "lmqg/qg_squad"
## Dataset Description
- **Repository:** [https://github.com/asahi417/lm-question-generation](https://github.com/asahi417/lm-question-generation)
- **Paper:** [https://arxiv.org/abs/2210.03992](https://arxiv.org/abs/2210.03992)
- **Point of Contact:** [Asahi Ushio](http://asahiushio.com/)
### Dataset Summary
This is a subset of [QG-Bench](https://github.com/asahi417/lm-question-generation/blob/master/QG_BENCH.md#datasets), a unified question generation benchmark proposed in
["Generative Language Models for Paragraph-Level Question Generation: A Unified Benchmark and Evaluation, EMNLP 2022 main conference"](https://arxiv.org/abs/2210.03992).
This is [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset for question generation (QG) task. The split
of train/development/test set follows the ["Neural Question Generation"](https://arxiv.org/abs/1705.00106) work and is
compatible with the [leader board](https://paperswithcode.com/sota/question-generation-on-squad11).
### Supported Tasks and Leaderboards
* `question-generation`: The dataset is assumed to be used to train a model for question generation.
Success on this task is typically measured by achieving a high BLEU4/METEOR/ROUGE-L/BERTScore/MoverScore (see our paper for more in detail).
This task has an active leaderboard which can be found at [here](https://paperswithcode.com/sota/question-generation-on-squad11).
### Languages
English (en)
## Dataset Structure
An example of 'train' looks as follows.
```
{
"question": "What is heresy mainly at odds with?",
"paragraph": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
"answer": "established beliefs or customs",
"sentence": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs .",
"paragraph_sentence": "<hl> Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs . <hl> A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
"paragraph_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl>. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
"sentence_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl> ."
}
```
The data fields are the same among all splits.
- `question`: a `string` feature.
- `paragraph`: a `string` feature.
- `answer`: a `string` feature.
- `sentence`: a `string` feature.
- `paragraph_answer`: a `string` feature, which is same as the paragraph but the answer is highlighted by a special token `<hl>`.
- `paragraph_sentence`: a `string` feature, which is same as the paragraph but a sentence containing the answer is highlighted by a special token `<hl>`.
- `sentence_answer`: a `string` feature, which is same as the sentence but the answer is highlighted by a special token `<hl>`.
Each of `paragraph_answer`, `paragraph_sentence`, and `sentence_answer` feature is assumed to be used to train a question generation model,
but with different information. The `paragraph_answer` and `sentence_answer` features are for answer-aware question generation and
`paragraph_sentence` feature is for sentence-aware question generation.
## Data Splits
|train|validation|test |
|----:|---------:|----:|
|75722| 10570|11877|
## Citation Information
```
@inproceedings{ushio-etal-2022-generative,
title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration",
author = "Ushio, Asahi and
Alva-Manchego, Fernando and
Camacho-Collados, Jose",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, U.A.E.",
publisher = "Association for Computational Linguistics",
}
```
提供机构:
lmqg
原始信息汇总
数据集概述
基本信息
- 名称: SQuAD for question generation
- 许可证: cc-by-4.0
- 语言: 英语 (en)
- 多语言性: 单语种
- 大小: 10K<n<100K
- 来源数据集: squad
- 任务类别: 文本生成
- 任务ID: 语言建模
- 标签: 问题生成
数据集描述
- 摘要: 本数据集是QG-Bench的一部分,用于段落级问题生成任务。它是SQuAD数据集的子集,专门用于问题生成任务。训练/开发/测试集的划分遵循“Neural Question Generation”研究,并与leader board兼容。
- 支持的任务和leaderboards:
- 任务: 问题生成
- 评估指标: BLEU4/METEOR/ROUGE-L/BERTScore/MoverScore
- Leaderboard: 链接
数据集结构
- 数据字段:
question: 字符串paragraph: 字符串answer: 字符串sentence: 字符串paragraph_answer: 字符串,答案部分用<hl>标记paragraph_sentence: 字符串,包含答案的句子用<hl>标记sentence_answer: 字符串,答案部分用<hl>标记
- 数据分割:
train: 75722validation: 10570test: 11877
引用信息
@inproceedings{ushio-etal-2022-generative, title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration", author = "Ushio, Asahi and Alva-Manchego, Fernando and Camacho-Collados, Jose", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, U.A.E.", publisher = "Association for Computational Linguistics", }



