five

wiki_split

收藏
魔搭社区2025-07-11 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/wiki_split
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "wiki_split" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://dataset-homepage/](https://dataset-homepage/) - **Repository:** https://github.com/google-research-datasets/wiki-split - **Paper:** [Learning To Split and Rephrase From Wikipedia Edit History](https://arxiv.org/abs/1808.09468) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 100.28 MB - **Size of the generated dataset:** 388.40 MB - **Total amount of disk used:** 488.68 MB ### Dataset Summary One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia Google's WikiSplit dataset was constructed automatically from the publicly available Wikipedia revision history. Although the dataset contains some inherent noise, it can serve as valuable training data for models that split or merge sentences. ### Supported Tasks and Leaderboards - Split and Rephrase ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 100.28 MB - **Size of the generated dataset:** 388.40 MB - **Total amount of disk used:** 488.68 MB An example of 'train' looks as follows. ``` { "complex_sentence": " '' As she translates from one language to another , she tries to find the appropriate wording and context in English that would correspond to the work in Spanish her poems and stories started to have differing meanings in their respective languages .", "simple_sentence_1": "' '' As she translates from one language to another , she tries to find the appropriate wording and context in English that would correspond to the work in Spanish . ", "simple_sentence_2": " Ergo , her poems and stories started to have differing meanings in their respective languages ." } ``` ### Data Fields The data fields are the same among all splits. #### default - `complex_sentence`: a `string` feature. - `simple_sentence_1`: a `string` feature. - `simple_sentence_2`: a `string` feature. ### Data Splits | name |train |validation|test| |-------|-----:|---------:|---:| |default|989944| 5000|5000| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information The WikiSplit dataset is a verbatim copy of certain content from the publicly available Wikipedia revision history. The dataset is therefore licensed under [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/). Any third party content or data is provided "As Is" without any warranty, express or implied. ### Citation Information ``` @inproceedings{botha-etal-2018-learning, title = "Learning To Split and Rephrase From {W}ikipedia Edit History", author = "Botha, Jan A. and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1080", doi = "10.18653/v1/D18-1080", pages = "732--737", } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@patrickvonplaten](https://github.com/patrickvonplaten), [@albertvillanova](https://github.com/albertvillanova), [@lewtun](https://github.com/lewtun) for adding this dataset.

# 数据集卡片:"wiki_split" ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏见讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可协议信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**:[https://dataset-homepage/](https://dataset-homepage/) - **代码仓库**:https://github.com/google-research-datasets/wiki-split - **相关论文**:[《基于维基百科编辑历史的分句与重述学习》](https://arxiv.org/abs/1808.09468) - **联系方式**:[更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小**:100.28 MB - **生成后数据集大小**:388.40 MB - **总磁盘占用空间**:488.68 MB ### 数据集概述 本数据集包含100万条英语语句,每条语句均被拆分为两句,二者组合后可保留原句语义,数据提取自维基百科。 谷歌的WikiSplit数据集由公开可用的维基百科修订历史自动构建。尽管该数据集存在一定固有噪声,但可作为分句或合并语句模型的优质训练数据。 ### 支持任务与基准排行榜 - 分句与重述(Split and Rephrase) ### 语言 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认配置 - **下载数据集文件大小**:100.28 MB - **生成后数据集大小**:388.40 MB - **总磁盘占用空间**:488.68 MB 训练集的一个示例如下: json { "complex_sentence": " '' As she translates from one language to another , she tries to find the appropriate wording and context in English that would correspond to the work in Spanish her poems and stories started to have differing meanings in their respective languages .", "simple_sentence_1": "' '' As she translates from one language to another , she tries to find the appropriate wording and context in English that would correspond to the work in Spanish . ", "simple_sentence_2": " Ergo , her poems and stories started to have differing meanings in their respective languages ." } ### 数据字段 所有划分的数据字段均一致。 #### 默认配置 - `complex_sentence`:字符串类型特征 - `simple_sentence_1`:字符串类型特征 - `simple_sentence_2`:字符串类型特征 ### 数据划分 | 划分名称 | 训练集 | 验证集 | 测试集 | |---------|-------:|---------:|---:| | 默认配置 | 989944 | 5000 | 5000 | ## 数据集构建 ### 构建初衷 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁? [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁? [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏见讨论 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息需补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可协议信息 WikiSplit数据集是对公开可用的维基百科修订历史中部分内容的原样复制。因此,该数据集采用[CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/)许可协议进行授权。任何第三方内容或数据均按“现状”提供,不附带任何明示或暗示的担保。 ### 引用信息 bibtex @inproceedings{botha-etal-2018-learning, title = "Learning To Split and Rephrase From {W}ikipedia Edit History", author = "Botha, Jan A. and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1080", doi = "10.18653/v1/D18-1080", pages = "732--737", } ### 贡献者 感谢[@thomwolf](https://github.com/thomwolf)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@albertvillanova](https://github.com/albertvillanova)、[@lewtun](https://github.com/lewtun)为本数据集的添加工作。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作