lmqg/qg_squad

Name: lmqg/qg_squad
Creator: lmqg
Published: 2022-12-02 18:51:10
License: 暂无描述

Hugging Face2022-12-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/lmqg/qg_squad

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 pretty_name: SQuAD for question generation language: en multilinguality: monolingual size_categories: 10K<n<100K source_datasets: squad task_categories: - text-generation task_ids: - language-modeling tags: - question-generation --- # Dataset Card for "lmqg/qg_squad" ## Dataset Description - **Repository:** [https://github.com/asahi417/lm-question-generation](https://github.com/asahi417/lm-question-generation) - **Paper:** [https://arxiv.org/abs/2210.03992](https://arxiv.org/abs/2210.03992) - **Point of Contact:** [Asahi Ushio](http://asahiushio.com/) ### Dataset Summary This is a subset of [QG-Bench](https://github.com/asahi417/lm-question-generation/blob/master/QG_BENCH.md#datasets), a unified question generation benchmark proposed in ["Generative Language Models for Paragraph-Level Question Generation: A Unified Benchmark and Evaluation, EMNLP 2022 main conference"](https://arxiv.org/abs/2210.03992). This is [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset for question generation (QG) task. The split of train/development/test set follows the ["Neural Question Generation"](https://arxiv.org/abs/1705.00106) work and is compatible with the [leader board](https://paperswithcode.com/sota/question-generation-on-squad11). ### Supported Tasks and Leaderboards * `question-generation`: The dataset is assumed to be used to train a model for question generation. Success on this task is typically measured by achieving a high BLEU4/METEOR/ROUGE-L/BERTScore/MoverScore (see our paper for more in detail). This task has an active leaderboard which can be found at [here](https://paperswithcode.com/sota/question-generation-on-squad11). ### Languages English (en) ## Dataset Structure An example of 'train' looks as follows. ``` { "question": "What is heresy mainly at odds with?", "paragraph": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.", "answer": "established beliefs or customs", "sentence": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs .", "paragraph_sentence": "<hl> Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs . <hl> A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.", "paragraph_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl>. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.", "sentence_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl> ." } ``` The data fields are the same among all splits. - `question`: a `string` feature. - `paragraph`: a `string` feature. - `answer`: a `string` feature. - `sentence`: a `string` feature. - `paragraph_answer`: a `string` feature, which is same as the paragraph but the answer is highlighted by a special token `<hl>`. - `paragraph_sentence`: a `string` feature, which is same as the paragraph but a sentence containing the answer is highlighted by a special token `<hl>`. - `sentence_answer`: a `string` feature, which is same as the sentence but the answer is highlighted by a special token `<hl>`. Each of `paragraph_answer`, `paragraph_sentence`, and `sentence_answer` feature is assumed to be used to train a question generation model, but with different information. The `paragraph_answer` and `sentence_answer` features are for answer-aware question generation and `paragraph_sentence` feature is for sentence-aware question generation. ## Data Splits |train|validation|test | |----:|---------:|----:| |75722| 10570|11877| ## Citation Information ``` @inproceedings{ushio-etal-2022-generative, title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration", author = "Ushio, Asahi and Alva-Manchego, Fernando and Camacho-Collados, Jose", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, U.A.E.", publisher = "Association for Computational Linguistics", } ```

提供机构：

lmqg

原始信息汇总

数据集概述

基本信息

名称: SQuAD for question generation
许可证: cc-by-4.0
语言: 英语 (en)
多语言性: 单语种
大小: 10K<n<100K
来源数据集: squad
任务类别: 文本生成
任务ID: 语言建模
标签: 问题生成

数据集描述

摘要: 本数据集是QG-Bench的一部分，用于段落级问题生成任务。它是SQuAD数据集的子集，专门用于问题生成任务。训练/开发/测试集的划分遵循“Neural Question Generation”研究，并与leader board兼容。
支持的任务和leaderboards:
- 任务: 问题生成
- 评估指标: BLEU4/METEOR/ROUGE-L/BERTScore/MoverScore
- Leaderboard: 链接

数据集结构

数据字段:
- question: 字符串
- paragraph: 字符串
- answer: 字符串
- sentence: 字符串
- paragraph_answer: 字符串，答案部分用<hl>标记
- paragraph_sentence: 字符串，包含答案的句子用<hl>标记
- sentence_answer: 字符串，答案部分用<hl>标记
数据分割:
- train: 75722
- validation: 10570
- test: 11877

引用信息

@inproceedings{ushio-etal-2022-generative, title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration", author = "Ushio, Asahi and Alva-Manchego, Fernando and Camacho-Collados, Jose", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, U.A.E.", publisher = "Association for Computational Linguistics", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集