five

conceptofmind/MegaWika

收藏
Hugging Face2024-11-19 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/conceptofmind/MegaWika
下载链接
链接失效反馈
官方服务:
资源简介:
MegaWika是一个多语言和跨语言的文本数据集,包含3000万条维基百科段落及其清理后的网络引用。这些段落涵盖了50种语言的维基百科,并且包含了这些段落最初嵌入的文章。对于非英语的维基百科段落,提供了自动翻译的英文版本。此外,从这些段落中提取了近1.3亿个英文问答对,并使用LOME FrameNet解析器检测了段落中发生的FrameNet事件。数据集按语言划分,每个语言的数据进一步分块为JSON行文件,每个实例包含从单个维基百科文章中提取的数据。

MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which the passages were originally embedded are included for convenience. Where a Wikipedia passage is in a non-English language, an automated English translation is provided. Furthermore, nearly 130 million English question/answer pairs were extracted from the passages, and FrameNet events occurring in the passages are detected using the LOME FrameNet parser. The dataset is divided by language, and the data for each of the 50 languages is further chunked into discrete JSON lines files. Each instance contains the data extracted from a single Wikipedia article.
提供机构:
conceptofmind
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作