Orange/webnlg-qa

Name: Orange/webnlg-qa
Creator: Orange
Published: 2024-01-11 13:19:10
License: 暂无描述

Hugging Face2024-01-11 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/Orange/webnlg-qa

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 dataset_info: features: - name: category dtype: string - name: size dtype: int32 - name: id dtype: string - name: eid dtype: string - name: original_triple_sets list: - name: subject dtype: string - name: property dtype: string - name: object dtype: string - name: modified_triple_sets list: - name: subject dtype: string - name: property dtype: string - name: object dtype: string - name: shape dtype: string - name: shape_type dtype: string - name: lex sequence: - name: comment dtype: string - name: lid dtype: string - name: text dtype: string - name: lang dtype: string - name: test_category dtype: string - name: dbpedia_links sequence: string - name: links sequence: string - name: graph list: list: string - name: main_entity dtype: string - name: mappings list: - name: modified dtype: string - name: readable dtype: string - name: graph dtype: string - name: dialogue list: - name: question list: - name: source dtype: string - name: text dtype: string - name: graph_query dtype: string - name: readable_query dtype: string - name: graph_answer list: string - name: readable_answer list: string - name: type list: string splits: - name: train num_bytes: 33200723 num_examples: 10016 - name: validation num_bytes: 4196972 num_examples: 1264 - name: test num_bytes: 4990595 num_examples: 1417 - name: challenge num_bytes: 420551 num_examples: 100 download_size: 9637685 dataset_size: 42808841 task_categories: - conversational - question-answering - text-generation tags: - qa - knowledge-graph - sparql language: - en --- # Dataset Card for WEBNLG-QA ## Dataset Description - **Paper:** [SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications (AACL-IJCNLP 2022)](https://aclanthology.org/2022.aacl-main.11/) - **Point of Contact:** Gwénolé Lecorvé ### Dataset Summary WEBNLG-QA is a conversational question answering dataset grounded on WEBNLG. It consists in a set of question-answering dialogues (follow-up question-answer pairs) based on short paragraphs of text. Each paragraph is associated a knowledge graph (from WEBNLG). The questions are associated with SPARQL queries. ### Supported tasks * Knowledge-based question-answering * SPARQL-to-Text conversion #### Knowledge based question-answering Below is an example of dialogue: - Q1: What is used as an instrument is Sludge Metal or in Post-metal? - A1: Singing, Synthesizer - Q2: And what about Sludge Metal in particular? - A2: Singing - Q3: Does the Year of No Light album Nord belong to this genre? - A3: Yes. #### SPARQL-to-Text Question Generation SPARQL-to-Text question generation refers to the task of converting a SPARQL query into a natural language question, eg: ```SQL SELECT (COUNT(?country) as ?answer) WHERE { ?country property:member_of resource:Europe . ?country property:population ?n . FILTER ( ?n > 10000000 ) } ``` could be converted into: ```txt How many European countries have more than 10 million inhabitants? ``` ## Dataset Structure ### Types of questions Comparison of question types compared to related datasets: | | | [SimpleQuestions](https://huggingface.co/datasets/OrangeInnov/simplequestions-sparqltotext) | [ParaQA](https://huggingface.co/datasets/OrangeInnov/paraqa-sparqltotext) | [LC-QuAD 2.0](https://huggingface.co/datasets/OrangeInnov/lcquad_2.0-sparqltotext) | [CSQA](https://huggingface.co/datasets/OrangeInnov/csqa-sparqltotext) | [WebNLQ-QA](https://huggingface.co/datasets/OrangeInnov/webnlg-qa) | |--------------------------|-----------------|:---------------:|:------:|:-----------:|:----:|:---------:| | **Number of triplets in query** | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | 2 | | ✓ | ✓ | ✓ | ✓ | | | More | | | ✓ | ✓ | ✓ | | **Logical connector between triplets** | Conjunction | ✓ | ✓ | ✓ | ✓ | ✓ | | | Disjunction | | | | ✓ | ✓ | | | Exclusion | | | | ✓ | ✓ | | **Topology of the query graph** | Direct | ✓ | ✓ | ✓ | ✓ | ✓ | | | Sibling | | ✓ | ✓ | ✓ | ✓ | | | Chain | | ✓ | ✓ | ✓ | ✓ | | | Mixed | | | ✓ | | ✓ | | | Other | | ✓ | ✓ | ✓ | ✓ | | **Variable typing in the query** | None | ✓ | ✓ | ✓ | ✓ | ✓ | | | Target variable | | ✓ | ✓ | ✓ | ✓ | | | Internal variable | | ✓ | ✓ | ✓ | ✓ | | **Comparisons clauses** | None | ✓ | ✓ | ✓ | ✓ | ✓ | | | String | | | ✓ | | ✓ | | | Number | | | ✓ | ✓ | ✓ | | | Date | | | ✓ | | ✓ | | **Superlative clauses** | No | ✓ | ✓ | ✓ | ✓ | ✓ | | | Yes | | | | ✓ | | | **Answer type** | Entity (open) | ✓ | ✓ | ✓ | ✓ | ✓ | | | Entity (closed) | | | | ✓ | ✓ | | | Number | | | ✓ | ✓ | ✓ | | | Boolean | | ✓ | ✓ | ✓ | ✓ | | **Answer cardinality** | 0 (unanswerable) | | | ✓ | | ✓ | | | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | More | | ✓ | ✓ | ✓ | ✓ | | **Number of target variables** | 0 (⇒ ASK verb) | | ✓ | ✓ | ✓ | ✓ | | | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | 2 | | | ✓ | | ✓ | | **Dialogue context** | Self-sufficient | ✓ | ✓ | ✓ | ✓ | ✓ | | | Coreference | | | | ✓ | ✓ | | | Ellipsis | | | | ✓ | ✓ | | **Meaning** | Meaningful | ✓ | ✓ | ✓ | ✓ | ✓ | | | Non-sense | | | | | ✓ | ### Data splits Text verbalization is only available for a subset of the test set, referred to as *challenge set*. Other sample only contain dialogues in the form of follow-up sparql queries. | | Train | Validation | Test | Challenge | | --------------------- | ---------- | ---------- | ---------- | ------------ | | Questions | 27727 | 3485 | 4179 | 332 | | Dialogues | 1001 | 1264 | 1417 | 100 | | NL question per query | 0 | 0 | 0 | 2 | | Characters per query | 129 (± 43) | 131 (± 45) | 122 (± 45) | 113 (± 38) | | Tokens per question | - | - | - | 8.4 (± 4.5) | ## Additional information ### Related datasets This corpus is part of a set of 5 datasets released for SPARQL-to-Text generation, namely: - Non conversational datasets - [SimpleQuestions](https://huggingface.co/datasets/OrangeInnov/simplequestions-sparqltotext) (from https://github.com/askplatypus/wikidata-simplequestions) - [ParaQA](https://huggingface.co/datasets/OrangeInnov/paraqa-sparqltotext) (from https://github.com/barshana-banerjee/ParaQA) - [LC-QuAD 2.0](https://huggingface.co/datasets/OrangeInnov/lcquad_2.0-sparqltotext) (from http://lc-quad.sda.tech/) - Conversational datasets - [CSQA](https://huggingface.co/datasets/OrangeInnov/csqa-sparqltotext) (from https://amritasaha1812.github.io/CSQA/) - [WebNLQ-QA](https://huggingface.co/datasets/OrangeInnov/webnlg-qa) (derived from https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0) ### Licencing information * Content from original dataset: CC-BY-SA 4.0 * New content: CC BY-SA 4.0 ### Citation information #### This dataset ```bibtex @inproceedings{lecorve2022sparql2text, title={SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications}, author={Lecorv\'e, Gw\'enol\'e and Veyret, Morgan and Brabant, Quentin and Rojas-Barahona, Lina M.}, journal={Proceedings of the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (AACL-IJCNLP)}, year={2022} } ``` #### The underlying corpus WEBNLG 3.0 ```bibtex @inproceedings{castro-ferreira-etal-2020-2020, title = "The 2020 Bilingual, Bi-Directional {W}eb{NLG}+ Shared Task: Overview and Evaluation Results ({W}eb{NLG}+ 2020)", author = "Castro Ferreira, Thiago and Gardent, Claire and Ilinykh, Nikolai and van der Lee, Chris and Mille, Simon and Moussallem, Diego and Shimorina, Anastasia", booktitle = "Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)", year = "2020", pages = "55--76" } ```

提供机构：

Orange

原始信息汇总

数据集概述

数据集描述

数据集概要

WEBNLG-QA 是一个基于 WEBNLG 的对话式问答数据集。它包含一系列基于短文本段落的问答对话（后续问答对），每个段落关联一个知识图谱（来自 WEBNLG）。问题与 SPARQL 查询相关联。

支持的任务

基于知识的问答
SPARQL 到文本的转换

数据集结构

特征描述

category: 类别，数据类型为字符串。
size: 大小，数据类型为整数。
id: 标识符，数据类型为字符串。
eid: 实体标识符，数据类型为字符串。
original_triple_sets: 原始三元组集合，包含主题、属性和对象，数据类型均为字符串。
modified_triple_sets: 修改后的三元组集合，包含主题、属性和对象，数据类型均为字符串。
shape: 形状，数据类型为字符串。
shape_type: 形状类型，数据类型为字符串。
lex: 词汇序列，包含评论、标识符、文本和语言，数据类型均为字符串。
test_category: 测试类别，数据类型为字符串。
dbpedia_links: DBpedia 链接，数据类型为字符串序列。
links: 链接，数据类型为字符串序列。
graph: 图，数据类型为字符串列表的列表。
main_entity: 主要实体，数据类型为字符串。
mappings: 映射，包含修改、可读性和图，数据类型均为字符串。
dialogue: 对话，包含问题、图查询、可读查询、图答案、可读答案和类型，数据类型为字符串列表。

数据分割

train: 训练集，包含 10016 个样本，大小为 33200723 字节。
validation: 验证集，包含 1264 个样本，大小为 4196972 字节。
test: 测试集，包含 1417 个样本，大小为 4990595 字节。
challenge: 挑战集，包含 100 个样本，大小为 420551 字节。

数据集大小

下载大小: 9637685 字节
数据集大小: 42808841 字节

任务类别

对话
问答
文本生成

语言

英语

5,000+

优质数据集

54 个

任务类型

进入经典数据集