five

Orange/KGConv

收藏
Hugging Face2024-04-09 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Orange/KGConv
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: labels data_files: data/labels.json - config_name: templates data_files: data/templates.json - config_name: conversations.country data_files: - path: data/country/test.json split: test - path: data/country/dev.json split: dev - path: data/country/train.json split: train - config_name: conversations.historical_event data_files: - path: data/historical_event/test.json split: test - path: data/historical_event/dev.json split: dev - path: data/historical_event/train.json split: train - config_name: conversations.food data_files: - path: data/food/test.json split: test - path: data/food/dev.json split: dev - path: data/food/train.json split: train - config_name: conversations.space_object data_files: - path: data/space_object/test.json split: test - config_name: conversations.with_unseen_properties data_files: - path: data/with_unseen_properties/test.json split: test - config_name: conversations.taxon data_files: - path: data/taxon/test.json split: test - config_name: conversations.person data_files: - path: data/person/test.json split: test - path: data/person/dev.json split: dev - path: data/person/train.json split: train - config_name: conversations.ideology data_files: - path: data/ideology/test.json split: test - path: data/ideology/dev.json split: dev - path: data/ideology/train.json split: train - config_name: conversations.molecular_entity data_files: - path: data/molecular_entity/test.json split: test - path: data/molecular_entity/dev.json split: dev - path: data/molecular_entity/train.json split: train --- # KGConv, a Conversational Corpus grounded in Wikidata ## Table of Contents - [Dataset Card Creation Guide](#dataset-card-creation-guide) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Repository:** [https://github.com/Orange-OpenSource/KGConv/]() - **Paper:** [https://arxiv.org/abs/2308.15298]() - **Point of Contact:** <quentin.brabant@orange.com>, <gwenole.lecorve@orange.com>, <linamaria.rojasbarahona@orange.com>, <claire.gardent@loria.fr> ### Dataset Summary KGConv is a large corpus of 71k english conversations where each question-answer pair is grounded in a Wikidata fact. The conversations were generated automatically: in particular, questions were created using a collection of 10,355 templates; subsequently, the naturalness of conversations was improved by inserting ellipses and coreference into questions, via both handcrafted rules and a generative rewriting model. The dataset thus provides several variants of each question (12 on average), organized into 3 levels of conversationality. KGConv can further be used for other generation and analysis tasks such as single-turn question generation from Wikidata triples, question rewriting, question answering from conversation or from knowledge graphs and quiz generation. ### Languages English. ## Dataset Structure The dataset has three components: - **conversation configs**, divided in several themes that correspond to configs of the form `conversations.theme`, where theme has to be replaced by one of the following: country, food, historical_event, ideology, molecular_entity, person, space_object, taxon, with_unseen_properties; - **labels**, a config that contains labels for all entities and properties involved in the conversations; - **templates**, a config that contains the templates that where used for generating questions. ### Data Instances Instance from the configs with name of the form "conversations.theme" (e.g. "conversations.country") have the following form: ``` { "conversation_id": "69795", "root_neighbourhood": [ [ "Q6138903", "P106", "Q82955" ], [ "Q6138903", "P19", "Q3408680" ], ... ], "conversation": [ { "triple": [ "Q691", "P30", "Q538" ], "question variants": [ { "out-of-context": "In which continent is Papua New Guinea located?", "in-context": "In which continent is Papua New Guinea located?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "In which continent is Papua New Guinea located?" }, { "out-of-context": "In what continent is Papua New Guinea in?", "in-context": "In what continent is Papua New Guinea in?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "In what continent is Papua New Guinea in?" }, ... ], "answer": "Oceania" }, { "triple": [ "Q691", "P38", "Q200759" ], "question variants": [ { "out-of-context": "What is accepted as the currency of Papua New Guinea?", "in-context": "What is accepted as the currency of Papua New Guinea?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "What is accepted as the currency?" }, { "out-of-context": "What is the currency of Papua New Guinea?", "in-context": "What is the currency of Papua New Guinea?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "What is the currency?" }, ... ], "answer": "kina" }, ... ``` Instances from the `labels` config are like this: ``` { "entity": "Q39", "labels": [ "Swiss Confederation", "CHE", "Confoederatio Helvetica", "Swiss", "Schweiz", "SUI", "Switzerland", "CH", "Suisse", "Svizzera" ], "preferred_label": "Switzerland" } ``` Instances from the `templates` config are as follows. ``` { "template_key": { "p": "P1201", "s_types": [ "Q149918" ], "o_types": [] }, "templates": [ { "left": "what is the space tug of ", "right": "?", "source": "interface:automatic labeler" }, { "left": "what was the space tug of ", "right": "?", "source": "interface:624dc1cd4432b5035ba082df" }, ... ] } ``` ### Data Fields The fields from the configs with name of the form "conversations.theme" (e.g. "conversations.country") are the following: - `conversation`: list of dicts; each dict reprensent one question+answer and has the following fields: - `conversation_id`: string - `root_neighbourhood`: list of triples (each triple is itself represented by a list of 3 string elements) that constitute the neighbourhood of the conversation root entity in the knowledge graph (see the LREC publication for more details) - `triple`: triple on which the question is based (list of three string elements) - `question variants`: list of dict; each dict contain several forms of a question obtained via a given template (see the LREC publication for more details) - `out-of-context`: one form of the question variant - `in-context`: another form of the question variant - `in-context subject ref`: how the subject is referred to in the in-context form - `synthetic-in-context`: yet another form of the question variant - `answer`: answer to the question (string) The fields from the `labels` config are the following: - `entity`: string, id of the entity - `labels`: list of strings - `preferred_label`: string The fields from the `templates` config are the following: - `template_key`: a dict containing the conditions for using the templates listed in `templates`, with the following fields: - `p`: id of the property - `s_types`: required types for subject - `o_types`: require types for object - `templates`: list of dicts representing templates; each dict has the following fields: - `left`: left part of the template - `right`: right part of the template - `source`: origin of the template (string) ## Additional Information ### Licensing Information This software is distributed under the Creative Commons Attribution 4.0 International, the text of which is available at https://spdx.org/licenses/CC-BY-4.0.html or see the "license.txt" file for more details. ### Citation Information ``` @article{brabant2023kgconv, title={KGConv, a Conversational Corpus grounded in Wikidata}, author={Quentin Brabant and Gwenole Lecorve and Lina M. Rojas-Barahona and Claire Gardent}, year={2023}, eprint={2308.15298}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
Orange
原始信息汇总

KGConv, a Conversational Corpus grounded in Wikidata

数据集概述

KGConv是一个包含71k英语对话的大型语料库,每个问答对都基于一个Wikidata事实。对话是自动生成的,使用了10,355个模板来创建问题,并通过手工规则和生成重写模型插入省略号和代词来提高对话的自然性。数据集提供了每个问题的多个变体(平均12个),分为三个对话级别。KGConv还可用于其他生成和分析任务,如从Wikidata三元组生成单轮问题、问题重写、从对话或知识图谱进行问答以及测验生成。

语言

英语。

数据集结构

数据集包含三个部分:

  • 对话配置,分为多个主题,对应于conversations.theme形式的配置,主题可以是country, food, historical_event, ideology, molecular_entity, person, space_object, taxon, with_unseen_properties。
  • 标签,包含对话中涉及的所有实体和属性的标签。
  • 模板,包含用于生成问题的模板。

数据实例

来自conversations.theme配置的实例格式如下:

json { "conversation_id": "69795", "root_neighbourhood": [ [ "Q6138903", "P106", "Q82955" ], [ "Q6138903", "P19", "Q3408680" ], ... ], "conversation": [ { "triple": [ "Q691", "P30", "Q538" ], "question variants": [ { "out-of-context": "In which continent is Papua New Guinea located?", "in-context": "In which continent is Papua New Guinea located?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "In which continent is Papua New Guinea located?" }, { "out-of-context": "In what continent is Papua New Guinea in?", "in-context": "In what continent is Papua New Guinea in?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "In what continent is Papua New Guinea in?" }, ... ], "answer": "Oceania" }, { "triple": [ "Q691", "P38", "Q200759" ], "question variants": [ { "out-of-context": "What is accepted as the currency of Papua New Guinea?", "in-context": "What is accepted as the currency of Papua New Guinea?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "What is accepted as the currency?" }, { "out-of-context": "What is the currency of Papua New Guinea?", "in-context": "What is the currency of Papua New Guinea?", "in-context subject ref": "Papua New Guinea", "synthetic-in-context": "What is the currency?" }, ... ], "answer": "kina" }, ...

来自labels配置的实例格式如下:

json { "entity": "Q39", "labels": [ "Swiss Confederation", "CHE", "Confoederatio Helvetica", "Swiss", "Schweiz", "SUI", "Switzerland", "CH", "Suisse", "Svizzera" ], "preferred_label": "Switzerland" }

来自templates配置的实例格式如下:

json { "template_key": { "p": "P1201", "s_types": [ "Q149918" ], "o_types": [] }, "templates": [ { "left": "what is the space tug of ", "right": "?", "source": "interface:automatic labeler" }, { "left": "what was the space tug of ", "right": "?", "source": "interface:624dc1cd4432b5035ba082df" }, ... ] }

数据字段

来自conversations.theme配置的字段如下:

  • conversation_id: 字符串
  • root_neighbourhood: 三元组列表(每个三元组由三个字符串元素组成),表示对话根实体在知识图谱中的邻域
  • triple: 基于该问题的三元组(三个字符串元素的列表)
  • question variants: 字典列表;每个字典包含通过给定模板获得的问题的几种形式
    • out-of-context: 问题的一种形式
    • in-context: 问题的另一种形式
    • in-context subject ref: 上下文形式中主题的引用方式
    • synthetic-in-context: 问题的另一种形式
  • answer: 问题的答案(字符串)

来自labels配置的字段如下:

  • entity: 实体的ID(字符串)
  • labels: 标签列表(字符串列表)
  • preferred_label: 首选标签(字符串)

来自templates配置的字段如下:

  • template_key: 包含使用templates中列出的模板的条件的字典
    • p: 属性的ID
    • s_types: 主体的所需类型
    • o_types: 对象的所需类型
  • templates: 模板列表;每个字典包含以下字段:
    • left: 模板的左侧部分
    • right: 模板的右侧部分
    • source: 模板的来源(字符串)
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作