kelm
收藏魔搭社区2025-07-11 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/kelm
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Corpus for Knowledge-Enhanced Language Model Pre-training (KELM)
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/google-research-datasets/KELM-corpus
- **Repository:** https://github.com/google-research-datasets/KELM-corpus
- **Paper:** https://arxiv.org/abs/2010.12688
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Data-To-Text Generation involves converting knowledge graph (KG) triples of the form (subject, relation, object) into
a natural language sentence(s). This dataset consists of English KG data converted into paired natural language text.
The generated corpus consists of ∼18M sentences spanning ∼45M triples with ∼1500 distinct relations.
### Supported Tasks and Leaderboards
The intended task is data-to-text generation, taking in a knowledge graph tuple and generating a natural language
representation from it. Specifically, the data is in the format the authors used to train a seq2seq language model
with the tuples concatenated into a single sequence.
### Languages
The dataset is in English.
## Dataset Structure
### Data Instances
Each instance consists of one KG triple paired with corresponding natural language.
### Data Fields
- `triple`: Wikipedia triples of the form ` ` where some subjects have multiple
relations, e.g. ` `. For more details on
how these relations are grouped, please refer to the paper.
- `sentence`: The corresponding Wikipedia sentence.
### Data Splits
The dataset includes a pre-determined train, validation, and test split.
## Dataset Creation
### Curation Rationale
The goal of the dataset's curation and the associated modeling work discussed in the paper is to be able to generate
natural text from a knowledge graph.
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
The data is sourced from English Wikipedia and it's associated knowledge graph.
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
From the paper:
> Wikipedia has documented ideological, gender6, and racial biases in its text. While the KELM corpus may still
contain some of these biases, certain types of biases may be reduced.
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
This dataset has been released under the [CC BY-SA 2.0 license](https://creativecommons.org/licenses/by-sa/2.0/).
### Citation Information
```
@misc{agarwal2020large,
title={Large Scale Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training},
author={Oshin Agarwal and Heming Ge and Siamak Shakeri and Rami Al-Rfou},
year={2020},
eprint={2010.12688},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@joeddav](https://github.com/joeddav) for adding this dataset.
# 知识增强语言模型预训练语料库(KELM)数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言类型](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** https://github.com/google-research-datasets/KELM-corpus
- **代码仓库:** https://github.com/google-research-datasets/KELM-corpus
- **相关论文:** https://arxiv.org/abs/2010.12688
- **排行榜:**
- **联系人:**
### 数据集概述
数据到文本生成(Data-To-Text Generation)是指将形如(主语,关系,宾语)的知识图谱(knowledge graph, KG)三元组转换为自然语言语句的任务。本数据集包含转换为配对自然语言文本的英文知识图谱数据。所构建的语料库包含约1800万条语句,覆盖约4500万个三元组与约1500种不同的关系。
### 支持任务与排行榜
本数据集面向的任务为数据到文本生成:输入知识图谱元组,生成对应的自然语言表述。具体而言,数据格式与作者用于训练序列到序列(seq2seq)语言模型的格式一致,即将三元组拼接为单条序列。
### 语言类型
本数据集采用英文。
## 数据集结构
### 数据实例
每个数据实例由一个知识图谱三元组及其对应的自然语言文本组成。
### 数据字段
- `triple`: 维基百科三元组,格式为` `,部分主语可对应多种关系,例如` `。如需了解关系分组的具体细节,请参阅相关论文。
- `sentence`: 对应的维基百科自然语言语句。
### 数据划分
本数据集包含预先划分好的训练集、验证集与测试集。
## 数据集构建
### 构建初衷
本数据集的构建目标与论文中提及的相关建模工作,均旨在实现从知识图谱生成自然语言文本。
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生成者是谁?
本数据集源自英文维基百科及其关联的知识图谱。
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
据相关论文所述:
> 维基百科的文本中存在意识形态、性别(标注为gender6)与种族层面的偏差。尽管KELM语料库仍可能保留部分此类偏差,但部分类型的偏差已有所缓解。
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 授权信息
本数据集采用[CC BY-SA 2.0协议](https://creativecommons.org/licenses/by-sa/2.0/)进行授权。
### 引用信息
@misc{agarwal2020large,
title={面向知识增强语言模型预训练的大规模知识图谱驱动合成语料生成},
author={Oshin Agarwal and Heming Ge and Siamak Shakeri and Rami Al-Rfou},
year={2020},
eprint={2010.12688},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 贡献致谢
感谢[@joeddav](https://github.com/joeddav)为本数据集的收录提供支持。
提供机构:
maas
创建时间:
2025-07-07



