Orange/csqa-sparqltotext

Name: Orange/csqa-sparqltotext
Creator: Orange
Published: 2024-01-11 13:15:33
License: 暂无描述

Hugging Face2024-01-11 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/Orange/csqa-sparqltotext

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 dataset_info: features: - name: id dtype: string - name: turns list: - name: id dtype: int64 - name: ques_type_id dtype: int64 - name: question-type dtype: string - name: description dtype: string - name: entities_in_utterance list: string - name: relations list: string - name: type_list list: string - name: speaker dtype: string - name: utterance dtype: string - name: all_entities list: string - name: active_set list: string - name: sec_ques_sub_type dtype: int64 - name: sec_ques_type dtype: int64 - name: set_op_choice dtype: int64 - name: is_inc dtype: int64 - name: count_ques_sub_type dtype: int64 - name: count_ques_type dtype: int64 - name: is_incomplete dtype: int64 - name: inc_ques_type dtype: int64 - name: set_op dtype: int64 - name: bool_ques_type dtype: int64 - name: entities list: string - name: clarification_step dtype: int64 - name: gold_actions list: list: string - name: is_spurious dtype: bool - name: masked_verbalized_answer dtype: string - name: parsed_active_set list: string - name: sparql_query dtype: string - name: verbalized_all_entities list: string - name: verbalized_answer dtype: string - name: verbalized_entities_in_utterance list: string - name: verbalized_gold_actions list: list: string - name: verbalized_parsed_active_set list: string - name: verbalized_sparql_query dtype: string - name: verbalized_triple dtype: string - name: verbalized_type_list list: string splits: - name: train num_bytes: 6815016095 num_examples: 152391 - name: test num_bytes: 1007873839 num_examples: 27797 - name: validation num_bytes: 692344634 num_examples: 16813 download_size: 2406342185 dataset_size: 8515234568 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* task_categories: - conversational - question-answering tags: - qa - knowledge-graph - sparql - multi-hop language: - en --- # Dataset Card for CSQA-SPARQLtoText ## Table of Contents - [Dataset Card for CSQA-SPARQLtoText](#dataset-card-for-csqa-sparqltotext) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported tasks](#supported-tasks) - [Knowledge based question-answering](#knowledge-based-question-answering) - [SPARQL queries and natural language questions](#sparql-queries-and-natural-language-questions) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Types of questions](#types-of-questions) - [Data splits](#data-splits) - [JSON fields](#json-fields) - [Original fields](#original-fields) - [New fields](#new-fields) - [Verbalized fields](#verbalized-fields) - [Format of the SPARQL queries](#format-of-the-sparql-queries) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [This version of the corpus (with SPARQL queries)](#this-version-of-the-corpus-with-sparql-queries) - [Original corpus (CSQA)](#original-corpus-csqa) - [CARTON](#carton) ## Dataset Description - **Paper:** [SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications (AACL-IJCNLP 2022)](https://aclanthology.org/2022.aacl-main.11/) - **Point of Contact:** Gwénolé Lecorvé ### Dataset Summary CSQA corpus (Complex Sequential Question-Answering, see https://amritasaha1812.github.io/CSQA/) is a large corpus for conversational knowledge-based question answering. The version here is augmented with various fields to make it easier to run specific tasks, especially SPARQL-to-text conversion. The original data has been post-processing as follows: 1. Verbalization templates were applied on the answers and their entities were verbalized (replaced by their label in Wikidata) 2. Questions were parsed using the CARTON algorithm to produce a sequence of action in a specific grammar 3. Sequence of actions were mapped to SPARQL queries and entities were verbalized (replaced by their label in Wikidata) ### Supported tasks - Knowledge-based question-answering - Text-to-SPARQL conversion #### Knowledge based question-answering Below is an example of dialogue: - Q1: Which occupation is the profession of Edmond Yernaux ? - A1: politician - Q2: Which collectable has that occupation as its principal topic ? - A2: Notitia Parliamentaria, An History of the Counties, etc. #### SPARQL queries and natural language questions ```SQL SELECT DISTINCT ?x WHERE { ?x rdf:type ontology:occupation . resource:Edmond_Yernaux property:occupation ?x } ``` is equivalent to: ```txt Which occupation is the profession of Edmond Yernaux ? ``` ### Languages - English ## Dataset Structure The corpus follows the global architecture from the original version of CSQA (https://amritasaha1812.github.io/CSQA/). There is one directory of the train, dev, and test sets, respectively. Dialogues are stored in separate directories, 100 dialogues per directory. Finally, each dialogue is stored in a JSON file as a list of turns. ### Types of questions Comparison of question types compared to related datasets: | | | [SimpleQuestions](https://huggingface.co/datasets/OrangeInnov/simplequestions-sparqltotext) | [ParaQA](https://huggingface.co/datasets/OrangeInnov/paraqa-sparqltotext) | [LC-QuAD 2.0](https://huggingface.co/datasets/OrangeInnov/lcquad_2.0-sparqltotext) | [CSQA](https://huggingface.co/datasets/OrangeInnov/csqa-sparqltotext) | [WebNLQ-QA](https://huggingface.co/datasets/OrangeInnov/webnlg-qa) | |--------------------------|-----------------|:---------------:|:------:|:-----------:|:----:|:---------:| | **Number of triplets in query** | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | 2 | | ✓ | ✓ | ✓ | ✓ | | | More | | | ✓ | ✓ | ✓ | | **Logical connector between triplets** | Conjunction | ✓ | ✓ | ✓ | ✓ | ✓ | | | Disjunction | | | | ✓ | ✓ | | | Exclusion | | | | ✓ | ✓ | | **Topology of the query graph** | Direct | ✓ | ✓ | ✓ | ✓ | ✓ | | | Sibling | | ✓ | ✓ | ✓ | ✓ | | | Chain | | ✓ | ✓ | ✓ | ✓ | | | Mixed | | | ✓ | | ✓ | | | Other | | ✓ | ✓ | ✓ | ✓ | | **Variable typing in the query** | None | ✓ | ✓ | ✓ | ✓ | ✓ | | | Target variable | | ✓ | ✓ | ✓ | ✓ | | | Internal variable | | ✓ | ✓ | ✓ | ✓ | | **Comparisons clauses** | None | ✓ | ✓ | ✓ | ✓ | ✓ | | | String | | | ✓ | | ✓ | | | Number | | | ✓ | ✓ | ✓ | | | Date | | | ✓ | | ✓ | | **Superlative clauses** | No | ✓ | ✓ | ✓ | ✓ | ✓ | | | Yes | | | | ✓ | | | **Answer type** | Entity (open) | ✓ | ✓ | ✓ | ✓ | ✓ | | | Entity (closed) | | | | ✓ | ✓ | | | Number | | | ✓ | ✓ | ✓ | | | Boolean | | ✓ | ✓ | ✓ | ✓ | | **Answer cardinality** | 0 (unanswerable) | | | ✓ | | ✓ | | | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | More | | ✓ | ✓ | ✓ | ✓ | | **Number of target variables** | 0 (⇒ ASK verb) | | ✓ | ✓ | ✓ | ✓ | | | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | | | 2 | | | ✓ | | ✓ | | **Dialogue context** | Self-sufficient | ✓ | ✓ | ✓ | ✓ | ✓ | | | Coreference | | | | ✓ | ✓ | | | Ellipsis | | | | ✓ | ✓ | | **Meaning** | Meaningful | ✓ | ✓ | ✓ | ✓ | ✓ | | | Non-sense | | | | | ✓ | ### Data splits Text verbalization is only available for a subset of the test set, referred to as *challenge set*. Other sample only contain dialogues in the form of follow-up sparql queries. | | Train | Validation | Test | | --------------------- | ---------- | ---------- | ---------- | | Questions | 1.5M | 167K | 260K | | Dialogues | 152K | 17K | 28K | | NL question per query | 1 | | Characters per query | 163 (± 100) | | Tokens per question | 10 (± 4) | ### JSON fields Each turn of a dialogue contains the following fields: #### Original fields * `ques_type_id`: ID corresponding to the question utterance * `description`: Description of type of question * `relations`: ID's of predicates used in the utterance * `entities_in_utterance`: ID's of entities used in the question * `speaker`: The nature of speaker: `SYSTEM` or `USER` * `utterance`: The utterance: either the question, clarification or response * `active_set`: A regular expression which identifies the entity set of answer list * `all_entities`: List of ALL entities which constitute the answer of the question * `question-type`: Type of question (broad types used for evaluation as given in the original authors' paper) * `type_list`: List containing entity IDs of all entity parents used in the question #### New fields * `is_spurious`: introduced by CARTON, * `is_incomplete`: either the question is self-sufficient (complete) or it relies on information given by the previous turns (incomplete) * `parsed_active_set`: * `gold_actions`: sequence of ACTIONs as returned by CARTON * `sparql_query`: SPARQL query #### Verbalized fields Fields with `verbalized` in their name are verbalized versions of another fields, ie IDs were replaced by actual words/labels. ### Format of the SPARQL queries * Clauses are in random order * Variables names are represented as random letters. The letters change from one turn to another. * Delimiters are spaced ## Additional Information ### Licensing Information * Content from original dataset: CC-BY-SA 4.0 * New content: CC BY-SA 4.0 ### Citation Information #### This version of the corpus (with SPARQL queries) ```bibtex @inproceedings{lecorve2022sparql2text, title={SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications}, author={Lecorv\'e, Gw\'enol\'e and Veyret, Morgan and Brabant, Quentin and Rojas-Barahona, Lina M.}, journal={Proceedings of the Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (AACL-IJCNLP)}, year={2022} } ``` #### Original corpus (CSQA) ```bibtex @InProceedings{saha2018complex, title = {Complex {Sequential} {Question} {Answering}: {Towards} {Learning} to {Converse} {Over} {Linked} {Question} {Answer} {Pairs} with a {Knowledge} {Graph}}, volume = {32}, issn = {2374-3468}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/11332}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, author = {Saha, Amrita and Pahuja, Vardaan and Khapra, Mitesh and Sankaranarayanan, Karthik and Chandar, Sarath}, month = apr, year = {2018} } ``` #### CARTON ```bibtex @InProceedings{plepi2021context, author="Plepi, Joan and Kacupaj, Endri and Singh, Kuldeep and Thakkar, Harsh and Lehmann, Jens", editor="Verborgh, Ruben and Hose, Katja and Paulheim, Heiko and Champin, Pierre-Antoine and Maleshkova, Maria and Corcho, Oscar and Ristoski, Petar and Alam, Mehwish", title="Context Transformer with Stacked Pointer Networks for Conversational Question Answering over Knowledge Graphs", booktitle="Proceedings of The Semantic Web", year="2021", publisher="Springer International Publishing", pages="356--371", isbn="978-3-030-77385-4" } ```

提供机构：

Orange

原始信息汇总

数据集概述

数据集信息

特征

id: 字符串类型
turns: 列表类型，包含以下字段：
- id: 整数类型
- ques_type_id: 整数类型
- question-type: 字符串类型
- description: 字符串类型
- entities_in_utterance: 字符串列表
- relations: 字符串列表
- type_list: 字符串列表
- speaker: 字符串类型
- utterance: 字符串类型
- all_entities: 字符串列表
- active_set: 字符串列表
- sec_ques_sub_type: 整数类型
- sec_ques_type: 整数类型
- set_op_choice: 整数类型
- is_inc: 整数类型
- count_ques_sub_type: 整数类型
- count_ques_type: 整数类型
- is_incomplete: 整数类型
- inc_ques_type: 整数类型
- set_op: 整数类型
- bool_ques_type: 整数类型
- entities: 字符串列表
- clarification_step: 整数类型
- gold_actions: 字符串列表的列表
- is_spurious: 布尔类型
- masked_verbalized_answer: 字符串类型
- parsed_active_set: 字符串列表
- sparql_query: 字符串类型
- verbalized_all_entities: 字符串列表
- verbalized_answer: 字符串类型
- verbalized_entities_in_utterance: 字符串列表
- verbalized_gold_actions: 字符串列表的列表
- verbalized_parsed_active_set: 字符串列表
- verbalized_sparql_query: 字符串类型
- verbalized_triple: 字符串类型
- verbalized_type_list: 字符串列表

数据分割

train: 包含152391个样本，大小为6815016095字节
test: 包含27797个样本，大小为1007873839字节
validation: 包含16813个样本，大小为692344634字节

数据集大小

下载大小: 2406342185字节
数据集大小: 8515234568字节

配置

default: 数据文件路径如下：
- train: data/train-*
- test: data/test-*
- validation: data/validation-*

任务类别

对话
问答

语言

英语

5,000+

优质数据集

54 个

任务类型

进入经典数据集