google-research-datasets/kelm

Name: google-research-datasets/kelm
Creator: google-research-datasets
Published: 2024-01-18 11:07:22
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/kelm

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 1M<n<10M source_datasets: - original task_categories: - other task_ids: [] paperswithcode_id: kelm pretty_name: Corpus for Knowledge-Enhanced Language Model Pre-training (KELM) tags: - data-to-text-generation dataset_info: features: - name: triple dtype: string - name: sentence dtype: string splits: - name: train num_bytes: 1343187306 num_examples: 6371131 - name: validation num_bytes: 167790917 num_examples: 796471 - name: test num_bytes: 167921750 num_examples: 796493 download_size: 1631259869 dataset_size: 1678899973 --- # Dataset Card for Corpus for Knowledge-Enhanced Language Model Pre-training (KELM) ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/google-research-datasets/KELM-corpus - **Repository:** https://github.com/google-research-datasets/KELM-corpus - **Paper:** https://arxiv.org/abs/2010.12688 - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Data-To-Text Generation involves converting knowledge graph (KG) triples of the form (subject, relation, object) into a natural language sentence(s). This dataset consists of English KG data converted into paired natural language text. The generated corpus consists of ∼18M sentences spanning ∼45M triples with ∼1500 distinct relations. ### Supported Tasks and Leaderboards The intended task is data-to-text generation, taking in a knowledge graph tuple and generating a natural language representation from it. Specifically, the data is in the format the authors used to train a seq2seq language model with the tuples concatenated into a single sequence. ### Languages The dataset is in English. ## Dataset Structure ### Data Instances Each instance consists of one KG triple paired with corresponding natural language. ### Data Fields - `triple`: Wikipedia triples of the form `<subject> <relation> <object>` where some subjects have multiple relations, e.g. `<subject> <relation1> <object1> <relation2> <object2> <relation3> <object3>`. For more details on how these relations are grouped, please refer to the paper. - `sentence`: The corresponding Wikipedia sentence. ### Data Splits The dataset includes a pre-determined train, validation, and test split. ## Dataset Creation ### Curation Rationale The goal of the dataset's curation and the associated modeling work discussed in the paper is to be able to generate natural text from a knowledge graph. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? The data is sourced from English Wikipedia and it's associated knowledge graph. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the paper: > Wikipedia has documented ideological, gender6, and racial biases in its text. While the KELM corpus may still contain some of these biases, certain types of biases may be reduced. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information This dataset has been released under the [CC BY-SA 2.0 license](https://creativecommons.org/licenses/by-sa/2.0/). ### Citation Information ``` @misc{agarwal2020large, title={Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training}, author={Oshin Agarwal and Heming Ge and Siamak Shakeri and Rami Al-Rfou}, year={2020}, eprint={2010.12688}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset.

annotations_creators: 标注创建者：现有资源获取（found） language_creators: 语言创建者：现有资源获取（found） language: 语言：英语（en） license: 许可协议：CC BY-SA 3.0 multilinguality: 多语言属性：单语言（monolingual） size_categories: 样本规模类别：100万 < 样本数 < 1000万 source_datasets: 源数据集：原始数据集（original） task_categories: 任务类别：其他（other） task_ids: 任务子类别：无 paperswithcode_id: PapersWithCode编号：kelm pretty_name: 正式名称：知识增强大语言模型预训练语料库（KELM，全称为Corpus for Knowledge-Enhanced Language Model Pre-training） tags: 标签：数据到文本生成（data-to-text-generation） dataset_info: features: - name: triple dtype: string - name: sentence dtype: string splits: - name: train num_bytes: 1343187306 num_examples: 6371131 - name: validation num_bytes: 167790917 num_examples: 796471 - name: test num_bytes: 167921750 num_examples: 796493 download_size: 1631259869 dataset_size: 1678899973 # 数据集卡片：知识增强大语言模型预训练语料库（KELM，全称为Corpus for Knowledge-Enhanced Language Model Pre-training） ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与评测基准](#supported-tasks-and-leaderboards) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据拆分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可协议信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集概述 - **主页**：https://github.com/google-research-datasets/KELM-corpus - **代码仓库**：https://github.com/google-research-datasets/KELM-corpus - **相关论文**：https://arxiv.org/abs/2010.12688 - **评测基准**：无 - **联系人**：无 ### 数据集摘要数据到文本生成（data-to-text-generation）任务指将形如（主语，谓词，宾语）的知识图谱（Knowledge Graph, KG）三元组转换为自然语言语句。本数据集包含转换为配对自然语言文本的英语知识图谱数据。所构建的语料库包含约1800万条语句，覆盖约4500万条三元组，涉及约1500种不同的谓词关系。 ### 支持任务与评测基准本数据集的预设任务为数据到文本生成：输入知识图谱元组，生成对应的自然语言表述。具体而言，本数据集采用作者用于训练序列到序列（seq2seq）大语言模型的格式，即将元组拼接为单个序列进行训练。 ### 使用语言本数据集采用英语。 ## 数据集结构 ### 数据实例每个数据实例由一个知识图谱三元组与对应的自然语言语句组成。 ### 数据字段 - `triple`：维基百科三元组，格式为`<主语> <谓词> <宾语>`，部分主语可对应多个谓词关系，例如`<主语> <谓词1> <宾语1> <谓词2> <宾语2> <谓词3> <宾语3>`。关于这些关系的分组细节，请参阅原论文。 - `sentence`：对应的维基百科自然语言语句。 ### 数据拆分本数据集包含预先划分好的训练集、验证集与测试集。 ## 数据集构建 ### 构建初衷本数据集的构建目标与论文中提及的相关建模工作，均旨在实现从知识图谱生成自然语言文本。 ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁？本数据集的数据来源于英语维基百科及其关联的知识图谱。 ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论根据原论文所述： > 维基百科的文本中存在已被记录的意识形态、性别与种族偏见。尽管KELM语料库可能仍保留部分此类偏见，但部分类型的偏见或已得到缓解。 ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可协议信息本数据集采用[CC BY-SA 2.0许可协议](https://creativecommons.org/licenses/by-sa/2.0/)发布。 ### 引用信息 @misc{agarwal2020large, title={Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training}, author={Oshin Agarwal and Heming Ge and Siamak Shakeri and Rami Al-Rfou}, year={2020}, eprint={2010.12688}, archivePrefix={arXiv}, primaryClass={cs.CL} } ### 贡献致谢感谢[@joeddav](https://github.com/joeddav)为本数据集的收录提供支持。

提供机构：

google-research-datasets

原始信息汇总

数据集概述

数据集描述

数据集摘要

数据集用于数据到文本生成任务，涉及将知识图谱（KG）三元组（主题，关系，对象）转换为自然语言句子。该数据集包含约1800万句子，涵盖约4500万三元组，具有约1500种不同关系。

支持的任务和排行榜

数据集旨在用于数据到文本生成任务，输入知识图谱元组并生成自然语言表示。

语言

数据集语言为英语。

数据集结构

数据实例

每个实例包含一个知识图谱三元组及其对应的自然语言句子。

数据字段

triple：维基百科三元组，形式为 <subject> <relation> <object>，某些主题具有多个关系。
sentence：对应的维基百科句子。

数据分割

数据集包含预定义的训练集、验证集和测试集。

数据集创建

策划理由

数据集的创建及其相关建模工作的目标是能够从知识图谱生成自然文本。

源数据

数据来源于英语维基百科及其关联的知识图谱。

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见讨论

维基百科的文本中存在意识形态、性别和种族偏见。尽管KELM语料库可能仍包含一些这些偏见，但某些类型的偏见可能减少。

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

[更多信息需补充]

许可信息

数据集在CC BY-SA 2.0许可下发布。

引用信息

@misc{agarwal2020large, title={Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training}, author={Oshin Agarwal and Heming Ge and Siamak Shakeri and Rami Al-Rfou}, year={2020}, eprint={2010.12688}, archivePrefix={arXiv}, primaryClass={cs.CL} }

贡献

感谢@joeddav添加此数据集。

搜集汇总

数据集介绍

构建方式

该数据集的构建旨在促进知识增强语言模型的前训练，通过将知识图谱中的三元组（主体、关系、对象）转换为自然语言句子，实现数据到文本的生成。数据集由大约18M个句子组成，涵盖了约45M个三元组和约1500种不同的关系。这些数据是从英语维基百科及其关联的知识图谱中收集和规范化的。

特点

KELM数据集的特点在于其专注于数据到文本生成的任务，为模型训练提供了丰富的三元组与自然语言句子的配对。数据集具有明确的训练、验证和测试划分，保证了模型训练和评估的效率。此外，数据集在构建时考虑了维基百科文本可能存在的偏见，力求在某种程度上减少这些偏差的影响。

使用方法

使用KELM数据集时，研究者可以将其作为训练数据来提高语言模型在数据到文本生成任务上的性能。数据集以预定的训练、验证和测试分割提供，方便研究者进行模型训练和评估。用户需要遵守CC BY-SA 2.0许可协议，并在使用数据集时引用相关文献。

背景与挑战

背景概述

在自然语言处理领域，知识图谱与自然语言的结合成为研究的热点。Corpus for Knowledge-Enhanced Language Model Pre-training (KELM) 数据集，由Google研究团队于2020年创建，旨在推进知识增强型语言模型的预训练研究。该数据集通过转换知识图谱中的三元组（主体、关系、客体）为自然语言句子，构建了一个包含约1800万句子的语料库，涵盖了约4500万个三元组和1500种不同的关系。其研究成果对于数据到文本生成任务具有重要的指导意义，对相关领域产生了显著影响。

当前挑战

该数据集在构建和应用过程中面临的挑战主要包括：如何准确地将知识图谱中的信息转化为自然语言表达，保持信息的准确性和可读性；在数据收集和预处理阶段，如何处理和避免数据源（如英语维基百科）中固有的偏见和局限性；此外，数据集中个人和敏感信息的处理，以及确保数据使用的合法性和伦理性，也是需要考虑的重要问题。

常用场景

经典使用场景

在知识图谱数据向自然语言转换的研究领域，KELM数据集扮演着至关重要的角色。其经典使用场景在于为知识增强语言模型的预训练提供大规模的合成语料库，通过将知识图谱中的三元组（主体、关系、客体）转化为自然语言句子，以促进模型对知识图谱的理解和表达。

衍生相关工作

基于KELM数据集的研究工作衍生出了多个相关领域的经典成果，如知识增强的机器翻译、对话系统中的知识图谱应用等。这些工作不仅推动了自然语言处理技术的发展，也为知识图谱的广泛应用提供了新的视角和方法论。

数据集最近研究