conceptofmind/MegaWika
收藏Hugging Face2024-11-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/conceptofmind/MegaWika
下载链接
链接失效反馈官方服务:
资源简介:
MegaWika是一个多语言和跨语言的文本数据集,包含3000万条维基百科段落及其清理后的网络引用。这些段落涵盖了50种语言的维基百科,并且包含了这些段落最初嵌入的文章。对于非英语的维基百科段落,提供了自动翻译的英文版本。此外,从这些段落中提取了近1.3亿个英文问答对,并使用LOME FrameNet解析器检测了段落中发生的FrameNet事件。数据集按语言划分,每个语言的数据进一步分块为JSON行文件,每个实例包含从单个维基百科文章中提取的数据。
MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which the passages were originally embedded are included for convenience. Where a Wikipedia passage is in a non-English language, an automated English translation is provided. Furthermore, nearly 130 million English question/answer pairs were extracted from the passages, and FrameNet events occurring in the passages are detected using the LOME FrameNet parser. The dataset is divided by language, and the data for each of the 50 languages is further chunked into discrete JSON lines files. Each instance contains the data extracted from a single Wikipedia article.
提供机构:
conceptofmind



