WikiAtomicEdits
收藏arXiv2018-08-29 更新2024-06-21 收录
下载链接:
http://goo.gl/language/wiki-atomic-edits
下载链接
链接失效反馈官方服务:
资源简介:
WikiAtomicEdits是由谷歌AI语言团队创建的多语言数据集,包含4300万条来自8种语言的维基百科编辑记录。这些编辑记录包括人类编辑插入或删除单个连续短语的实例。数据集的创建过程涉及从维基百科的历史快照中提取句子级别的编辑,并通过高效的算法进行处理。WikiAtomicEdits数据集特别适用于研究语义、话语和表示学习,旨在通过分析编辑过程中的语言变化,提供不同于标准语料库的语义和话语信号。
WikiAtomicEdits is a multilingual dataset developed by the Google AI Language Team, encompassing 43 million Wikipedia edit records spanning 8 languages. These records capture instances where human editors insert or delete single contiguous phrases. The dataset was constructed by extracting sentence-level edits from Wikipedia's historical snapshots and processing them via efficient algorithms. Specifically, the WikiAtomicEdits dataset is well-suited for research on semantics, discourse and representation learning, and seeks to offer semantic and discourse signals distinct from standard corpora by analyzing the linguistic shifts that take place during the editing process.
提供机构:
谷歌AI语言
创建时间:
2018-08-29



