five

Orange/webnlg-qa

收藏
Hugging Face2024-01-11 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/Orange/webnlg-qa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 dataset_info: features: - name: category dtype: string - name: size dtype: int32 - name: id dtype: string - name: eid dtype: string - name: original_triple_sets list: - name: subject dtype: string - name: property dtype: string - name: object dtype: string - name: modified_triple_sets list: - name: subject dtype: string - name: property dtype: string - name: object dtype: string - name: shape dtype: string - name: shape_type dtype: string - name: lex sequence: - name: comment dtype: string - name: lid dtype: string - name: text dtype: string - name: lang dtype: string - name: test_category dtype: string - name: dbpedia_links sequence: string - name: links sequence: string - name: graph list: list: string - name: main_entity dtype: string - name: mappings list: - name: modified dtype: string - name: readable dtype: string - name: graph dtype: string - name: dialogue list: - name: question list: - name: source dtype: string - name: text dtype: string - name: graph_query dtype: string - name: readable_query dtype: string - name: graph_answer list: string - name: readable_answer list: string - name: type list: string splits: - name: train num_bytes: 33200723 num_examples: 10016 - name: validation num_bytes: 4196972 num_examples: 1264 - name: test num_bytes: 4990595 num_examples: 1417 - name: challenge num_bytes: 420551 num_examples: 100 download_size: 9637685 dataset_size: 42808841 task_categories: - conversational - question-answering - text-generation tags: - qa - knowledge-graph - sparql language: - en --- # Dataset Card for WEBNLG-QA ## Dataset Description - **Paper:** [SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications (AACL-IJCNLP 2022)](https://aclanthology.org/2022.aacl-main.11/) - **Point of Contact:** Gwénolé Lecorvé ### Dataset Summary WEBNLG-QA is a conversational question answering dataset grounded on WEBNLG. It consists in a set of question-answering dialogues (follow-up question-answer pairs) based on short paragraphs of text. Each paragraph is associated a knowledge graph (from WEBNLG). The questions are associated with SPARQL queries. ### Supported tasks * Knowledge-based question-answering * SPARQL-to-Text conversion #### Knowledge based question-answering Below is an example of dialogue: - Q1: What is used as an instrument is Sludge Metal or in Post-metal? - A1: Singing, Synthesizer - Q2: And what about Sludge Metal in particular? - A2: Singing - Q3: Does the Year of No Light album Nord belong to this genre? - A3: Yes. #### SPARQL-to-Text Question Generation SPARQL-to-Text question generation refers to the task of converting a SPARQL query into a natural language question, eg: ```SQL SELECT (COUNT(?country) as ?answer) WHERE { ?country property:member_of resource:Europe . ?country property:population ?n . FILTER ( ?n > 10000000 ) } ``` could be converted into: ```txt How many European countries have more than 10 million inhabitants? ``` ## Dataset Structure ### Types of questions Comparison of question types compared to related datasets: | | | [SimpleQuestions](https://huggingface.co/datasets/OrangeInnov/simplequestions-sparqltotext) | [ParaQA](https://huggingface.co/datasets/OrangeInnov/paraqa-sparqltotext) | [LC-QuAD 2.0](https://huggingface.co/datasets/OrangeInnov/lcquad_2.0-sparqltotext) | [CSQA](https://huggingface.co/datasets/OrangeInnov/csqa-sparqltotext) | [WebNLQ-QA](https://huggingface.co/datasets/OrangeInnov/webnlg-qa) | |--------------------------|-----------------|:---------------:|:------:|:-----------:|:----:|:---------:| | **Number of triplets in query** | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | 2 | | ✓ | ✓ | ✓ | ✓ | | | More | | | ✓ | ✓ | ✓ | | **Logical connector between triplets** | Conjunction | ✓ | ✓ | ✓ | ✓ | ✓ | | | Disjunction | | | | ✓ | ✓ | | | Exclusion | | | | ✓ | ✓ | | **Topology of the query graph** | Direct | ✓ | ✓ | ✓ | ✓ | ✓ | | | Sibling | | ✓ | ✓ | ✓ | ✓ | | | Chain | | ✓ | ✓ | ✓ | ✓ | | | Mixed | | | ✓ | | ✓ | | | Other | | ✓ | ✓ | ✓ | ✓ | | **Variable typing in the query** | None | ✓ | ✓ | ✓ | ✓ | ✓ | | | Target variable | | ✓ | ✓ | ✓ | ✓ | | | Internal variable | | ✓ | ✓ | ✓ | ✓ | | **Comparisons clauses** | None | ✓ | ✓ | ✓ | ✓ | ✓ | | | String | | | ✓ | | ✓ | | | Number | | | ✓ | ✓ | ✓ | | | Date | | | ✓ | | ✓ | | **Superlative clauses** | No | ✓ | ✓ | ✓ | ✓ | ✓ | | | Yes | | | | ✓ | | | **Answer type** | Entity (open) | ✓ | ✓ | ✓ | ✓ | ✓ | | | Entity (closed) | | | | ✓ | ✓ | | | Number | | | ✓ | ✓ | ✓ | | | Boolean | | ✓ | ✓ | ✓ | ✓ | | **Answer cardinality** | 0 (unanswerable) | | | ✓ | | ✓ | | | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | More | | ✓ | ✓ | ✓ | ✓ | | **Number of target variables** | 0 (⇒ ASK verb) | | ✓ | ✓ | ✓ | ✓ | | | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | 2 | | | ✓ | | ✓ | | **Dialogue context** | Self-sufficient | ✓ | ✓ | ✓ | ✓ | ✓ | | | Coreference | | | | ✓ | ✓ | | | Ellipsis | | | | ✓ | ✓ | | **Meaning** | Meaningful | ✓ | ✓ | ✓ | ✓ | ✓ | | | Non-sense | | | | | ✓ | ### Data splits Text verbalization is only available for a subset of the test set, referred to as *challenge set*. Other sample only contain dialogues in the form of follow-up sparql queries. | | Train | Validation | Test | Challenge | | --------------------- | ---------- | ---------- | ---------- | ------------ | | Questions | 27727 | 3485 | 4179 | 332 | | Dialogues | 1001 | 1264 | 1417 | 100 | | NL question per query | 0 | 0 | 0 | 2 | | Characters per query | 129 (± 43) | 131 (± 45) | 122 (± 45) | 113 (± 38) | | Tokens per question | - | - | - | 8.4 (± 4.5) | ## Additional information ### Related datasets This corpus is part of a set of 5 datasets released for SPARQL-to-Text generation, namely: - Non conversational datasets - [SimpleQuestions](https://huggingface.co/datasets/OrangeInnov/simplequestions-sparqltotext) (from https://github.com/askplatypus/wikidata-simplequestions) - [ParaQA](https://huggingface.co/datasets/OrangeInnov/paraqa-sparqltotext) (from https://github.com/barshana-banerjee/ParaQA) - [LC-QuAD 2.0](https://huggingface.co/datasets/OrangeInnov/lcquad_2.0-sparqltotext) (from http://lc-quad.sda.tech/) - Conversational datasets - [CSQA](https://huggingface.co/datasets/OrangeInnov/csqa-sparqltotext) (from https://amritasaha1812.github.io/CSQA/) - [WebNLQ-QA](https://huggingface.co/datasets/OrangeInnov/webnlg-qa) (derived from https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0) ### Licencing information * Content from original dataset: CC-BY-SA 4.0 * New content: CC BY-SA 4.0 ### Citation information #### This dataset ```bibtex @inproceedings{lecorve2022sparql2text, title={SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications}, author={Lecorv\'e, Gw\'enol\'e and Veyret, Morgan and Brabant, Quentin and Rojas-Barahona, Lina M.}, journal={Proceedings of the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (AACL-IJCNLP)}, year={2022} } ``` #### The underlying corpus WEBNLG 3.0 ```bibtex @inproceedings{castro-ferreira-etal-2020-2020, title = "The 2020 Bilingual, Bi-Directional {W}eb{NLG}+ Shared Task: Overview and Evaluation Results ({W}eb{NLG}+ 2020)", author = "Castro Ferreira, Thiago and Gardent, Claire and Ilinykh, Nikolai and van der Lee, Chris and Mille, Simon and Moussallem, Diego and Shimorina, Anastasia", booktitle = "Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)", year = "2020", pages = "55--76" } ```
提供机构:
Orange
原始信息汇总

数据集概述

数据集描述

数据集概要

WEBNLG-QA 是一个基于 WEBNLG 的对话式问答数据集。它包含一系列基于短文本段落的问答对话(后续问答对),每个段落关联一个知识图谱(来自 WEBNLG)。问题与 SPARQL 查询相关联。

支持的任务

  • 基于知识的问答
  • SPARQL 到文本的转换

数据集结构

特征描述

  • category: 类别,数据类型为字符串。
  • size: 大小,数据类型为整数。
  • id: 标识符,数据类型为字符串。
  • eid: 实体标识符,数据类型为字符串。
  • original_triple_sets: 原始三元组集合,包含主题、属性和对象,数据类型均为字符串。
  • modified_triple_sets: 修改后的三元组集合,包含主题、属性和对象,数据类型均为字符串。
  • shape: 形状,数据类型为字符串。
  • shape_type: 形状类型,数据类型为字符串。
  • lex: 词汇序列,包含评论、标识符、文本和语言,数据类型均为字符串。
  • test_category: 测试类别,数据类型为字符串。
  • dbpedia_links: DBpedia 链接,数据类型为字符串序列。
  • links: 链接,数据类型为字符串序列。
  • graph: 图,数据类型为字符串列表的列表。
  • main_entity: 主要实体,数据类型为字符串。
  • mappings: 映射,包含修改、可读性和图,数据类型均为字符串。
  • dialogue: 对话,包含问题、图查询、可读查询、图答案、可读答案和类型,数据类型为字符串列表。

数据分割

  • train: 训练集,包含 10016 个样本,大小为 33200723 字节。
  • validation: 验证集,包含 1264 个样本,大小为 4196972 字节。
  • test: 测试集,包含 1417 个样本,大小为 4990595 字节。
  • challenge: 挑战集,包含 100 个样本,大小为 420551 字节。

数据集大小

  • 下载大小: 9637685 字节
  • 数据集大小: 42808841 字节

任务类别

  • 对话
  • 问答
  • 文本生成

标签

  • qa
  • knowledge-graph
  • sparql

语言

  • 英语
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作