google-research-datasets/wiki_split
收藏数据集概述
数据集摘要
WikiSplit数据集包含一百万条英语句子,每个句子被分成两个句子,这两个句子共同保留原始句子的意义。该数据集是从Wikipedia的修订历史中自动提取的。尽管数据集存在一些固有的噪声,但它可以作为训练模型进行句子分割或合并的有价值数据。
支持的任务和排行榜
- 句子分割和重述
语言
- 英语
数据集结构
数据实例
一个训练集的示例如下:
json { "complex_sentence": " As she translates from one language to another , she tries to find the appropriate wording and context in English that would correspond to the work in Spanish her poems and stories started to have differing meanings in their respective languages .", "simple_sentence_1": " As she translates from one language to another , she tries to find the appropriate wording and context in English that would correspond to the work in Spanish . ", "simple_sentence_2": " Ergo , her poems and stories started to have differing meanings in their respective languages ." }
数据字段
所有分割的数据字段相同:
complex_sentence: 类型为string。simple_sentence_1: 类型为string。simple_sentence_2: 类型为string。
数据分割
| 名称 | 训练集 | 验证集 | 测试集 |
|---|---|---|---|
| default | 989944 | 5000 | 5000 |
数据集创建
数据集来源
数据集是从Wikipedia的修订历史中自动提取的。
许可证信息
WikiSplit数据集基于Wikipedia的内容,因此遵循CC BY-SA 4.0许可证。
引用信息
plaintext @inproceedings{botha-etal-2018-learning, title = "Learning To Split and Rephrase From {W}ikipedia Edit History", author = "Botha, Jan A. and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1080", doi = "10.18653/v1/D18-1080", pages = "732--737", }




