megawika-2
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/megawika-2
下载链接
链接失效反馈官方服务:
资源简介:
# MegaWika 2
MegaWika 2 is an improved multilingual text dataset containing a structured view of Wikipedia articles, the web sources they cite, source text quality estimates, article text translations, and additional article enrichments.
**Note:** Web citations (sources) in the HuggingFace dataset do not include scraped source text; use [rehydrate-citations.py](https://huggingface.co/datasets/jhu-clsp/megawika-2/blob/main/rehydrate-citations.py) to rehydrate them.
The initial data release is based on Wikipedia dumps from May 1, 2024.
In total, the data contains about 77 million articles and 71 million scraped web citations.
The English collection, the largest, contains about 10 million articles and 24 million scraped web citations.
In the future, we may release *deltas,* collections of articles that have been added or changed since the initial dump (or since the previous delta release).
We expect a fraction of the articles to change between dumps; hence, deltas will be significantly smaller and more compact than the initial collection.
## Quick Links
- [Dataset on HuggingFace](https://hf.co/datasets/jhu-clsp/megawika-2)
- [Online documentation](https://megawika.ccmaymay.net/) including browsable data schema
- [Whitepaper on ArXiv](https://arxiv.org/abs/2508.03828) including dataset details and analysis
- [MegaWika 1 Preprint on ArXiv](https://arxiv.org/abs/2307.07049)
## Languages Covered
As in MegaWika 1, MegaWika 2 spans 50 languages, including English, designated by their two-character ISO 639-1 language code:
- `af`: Afrikaans
- `ar`: Arabic
- `az`: Azeri (Azerbaijani)
- `bn`: Bengali
- `cs`: Czech
- `de`: German (Deutsch)
- `en`: English
- `es`: Spanish (Español)
- `et`: Estonian
- `fa`: Farsi (Persian)
- `fi`: Finnish
- `fr`: French
- `ga`: Irish (Gaelic)
- `gl`: Galician
- `gu`: Gujarati
- `he`: Hebrew
- `hi`: Hindi
- `hr`: Croatian
- `id`: Indonesian
- `it`: Italian
- `ja`: Japanese
- `ka`: Georgian (Kartvelian/Kartlian)
- `kk`: Kazakh
- `km`: Khmer
- `ko`: Korean
- `lt`: Lithuanian
- `lv`: Latvian
- `mk`: Macedonian (Makedonski)
- `ml`: Malay (Malayalam)
- `mn`: Mongolian
- `mr`: Marathi
- `my`: Burmese (Myanmar language)
- `ne`: Nepali
- `nl`: Dutch (Nederlands)
- `pl`: Polish
- `ps`: Pashto
- `pt`: Portuguese
- `ro`: Romanian
- `ru`: Russian
- `si`: Sinhalese (Sri Lankan language)
- `sl`: Slovenian
- `sv`: Swedish (Svenska)
- `ta`: Tamil
- `th`: Thai
- `tr`: Turkish
- `uk`: Ukrainian
- `ur`: Urdu
- `vi`: Vietnamese
- `xh`: Xhosa
- `zh`: Chinese (Zhōngwén)
## Dataset Structure
### Directory Structure
The MegaWika 2 dataset consists of a list of directories, one for each language, designated by its language code.
Each language subdirectory contains a list of chunks in JSON-lines format, where each chunk contains up to 1,000 articles, and each line of a chunk file is a distinct JSON-encoded Wikipedia article:
```
─ en/
├─ data/
│ ├─ 000000001.jsonl
│ ├─ 000000002.jsonl
│ └─ [...]
└─ metrics.json
```
Each language subdirectory also contains language-specific summary statistics (`metrics.json`) and a directory containing the data chunks (`data`).
### JSON Schema
The full data schema for MegaWika 2 is described in [`schema.md`](https://huggingface.co/datasets/jhu-clsp/megawika-2/blob/main/schema.md).
Among other things, each article object contains the article title, the article's raw wikicode and parsed text, and a hierarchy of objects representing the article structure.
This hierarchy includes, among many other things:
* The top level of this hierarchy is a list of headings, paragraphs, tables, infoboxes, and other block-level elements.
* These block-level elements contain various sub-elements; for example, each paragraph contains a list of sentences.
* Each sentence contains the sentence text, translated (English) sentence text, and a list of citations.
* Each citation includes the raw wikicode content, the character index of the citation in the sentence text, an optional citation URL, and optional scraped citation source text.
## Statistics
The metrics files (for example, `en/metrics.json`) provide statistics describing the data collected for each language.
MegaWika 2 features greater coverage than MegaWika 1, including marked improvements in recall for the citation detection and source scraping/extraction processes:
| Metric | Version 1 | Version 2.0 | Increase |
|---------------------------------------|-----------:|-------------:|--------------:|
| Articles Collected | 2,072,726 | 9,841,417 | 375% |
| Web Citations Detected | 17,368,499 | 57,431,369 | 231% |
| Web Citations Successfully Scraped | 5,623,386 | 23,544,500 | 319% |
| Web Citation Scrape/Extraction Recall | 32% | 41% | 27% (relative) |
## Changelog
These entries summarize differences between versions; see the data schema in [`schema.md`](https://huggingface.co/datasets/jhu-clsp/megawika-2/blob/main/schema.md) for details.
### 2.0 (Differences from MegaWika 1)
MegaWika version 2 introduces a comprehensive redesign of the MegaWika data structure.
MegaWika 2 captures not just passage/source pairs, but the structure and relationship of the text---and the sources cited in that text---to the surrounding Wikipedia article.
Specifically, each article contains a structured element list parsed from the original Wikitext; the Wikitext is also provided for reference.
Paragraph elements in MegaWika 2 contain sentence-segmented text, further facilitating downstream research.
In parallel, each article contains a list of excerpts (in MegaWika 1, *passages*) with one or more citations attached to them, compared to the passage-citation pairs---supporting only one citation per passage---in MegaWika 1.
MegaWika 2.0 does not include translation probabilities, "repetitious translation" annotations, source language ID, or generated question-answer pairs as in MegaWika 1, but it does add a large amount of other metadata, including article creation and last revision dates, cross-lingual links, short source/citation snippets provided by authors, and source text quality estimates.
Along the way, we have improved the recall of the citation extraction process by (among other changes):
- Adding support for named citation resolution
- Expanding the coverage of citation syntax understood by the citation detector
- Including not just citations with scrapable URLs, but *all* citations, to support researchers who may want to study Wikipedia citation behavior in general, and across languages
- Increasing the scraped source code size limit
Statistics characterizing the improved recall in citation detection are provided in the [Statistics](#statistics) section.
Additional statistics are provided in the metrics files (for example, `en/metrics.json`) in the dataset.
MegaWika 2 also introduces improvements to error handling, providing higher coverage across the board.
Errors and metadata for source scraping and extraction are included in the data, enabling analysis of sources of missing data and potential biases in the data.
For additional details and analysis of the MegaWika 2.0 dataset and its construction, please see our [whitepaper on ArXiv](https://arxiv.org/abs/2508.03828).
# MegaWika 2
MegaWika 2是一款经过优化的多语言文本数据集,收录了结构化呈现的维基百科(Wikipedia)条目、条目所引用的网络来源、来源文本质量评估结果、条目文本译文,以及额外的条目增强信息。
**注意:** HuggingFace数据集中的网络引用(来源)未包含已抓取的来源文本,请使用[rehydrate-citations.py](https://huggingface.co/datasets/jhu-clsp/megawika-2/blob/main/rehydrate-citations.py)脚本对其进行还原。
本次初始数据发布基于2024年5月1日的维基百科数据快照。该数据集总计包含约7700万条条目与7100万个已抓取的网络引用。其中体量最大的英语子集包含约1000万条条目与2400万个已抓取的网络引用。
未来我们或将发布*deltas*(增量数据集),即包含自初始快照(或上一次增量发布)以来新增或修改的条目的数据集。我们预计部分条目会在两次快照更新间发生变更,因此增量数据集的体量将远小于初始数据集,更为精简紧凑。
## 快速链接
- [HuggingFace数据集页面](https://hf.co/datasets/jhu-clsp/megawika-2)
- [在线文档](https://megawika.ccmaymay.net/),包含可浏览的数据模式说明
- [ArXiv白皮书](https://arxiv.org/abs/2508.03828),包含数据集细节与分析内容
- [MegaWika 1预印本(ArXiv)](https://arxiv.org/abs/2307.07049)
## 覆盖语言
与MegaWika 1一致,MegaWika 2涵盖50种语言,以ISO 639-1双字符语言代码标识,包括英语:
- `af`: 南非荷兰语(Afrikaans)
- `ar`: 阿拉伯语(Arabic)
- `az`: 阿塞拜疆语(Azerbaijani)
- `bn`: 孟加拉语(Bengali)
- `cs`: 捷克语(Czech)
- `de`: 德语(German)
- `en`: 英语(English)
- `es`: 西班牙语(Español)
- `et`: 爱沙尼亚语(Estonian)
- `fa`: 波斯语(Persian,原称Farsi)
- `fi`: 芬兰语(Finnish)
- `fr`: 法语(French)
- `ga`: 爱尔兰语(Gaelic)
- `gl`: 加利西亚语(Galician)
- `gu`: 古吉拉特语(Gujarati)
- `he`: 希伯来语(Hebrew)
- `hi`: 印地语(Hindi)
- `hr`: 克罗地亚语(Croatian)
- `id`: 印度尼西亚语(Indonesian)
- `it`: 意大利语(Italian)
- `ja`: 日语(Japanese)
- `ka`: 格鲁吉亚语(Kartvelian/Kartlian)
- `kk`: 哈萨克语(Kazakh)
- `km`: 高棉语(Khmer)
- `ko`: 韩语(Korean)
- `lt`: 立陶宛语(Lithuanian)
- `lv`: 拉脱维亚语(Latvian)
- `mk`: 马其顿语(Makedonski)
- `ml`: 马拉雅拉姆语(Malayalam,原文标注Malay)
- `mn`: 蒙古语(Mongolian)
- `mr`: 马拉地语(Marathi)
- `my`: 缅甸语(Burmese,即缅甸语言)
- `ne`: 尼泊尔语(Nepali)
- `nl`: 荷兰语(Nederlands)
- `pl`: 波兰语(Polish)
- `ps`: 普什图语(Pashto)
- `pt`: 葡萄牙语(Portuguese)
- `ro`: 罗马尼亚语(Romanian)
- `ru`: 俄语(Russian)
- `si`: 僧伽罗语(Sinhalese,即斯里兰卡语言)
- `sl`: 斯洛文尼亚语(Slovenian)
- `sv`: 瑞典语(Svenska)
- `ta`: 泰米尔语(Tamil)
- `th`: 泰语(Thai)
- `tr`: 土耳其语(Turkish)
- `uk`: 乌克兰语(Ukrainian)
- `ur`: 乌尔都语(Urdu)
- `vi`: 越南语(Vietnamese)
- `xh`: 科萨语(Xhosa)
- `zh`: 中文(Zhōngwén)
## 数据集结构
### 目录结构
MegaWika 2数据集由多个目录组成,每个目录对应一种语言,以语言代码标识。每个语言子目录包含若干按JSON Lines格式存储的数据块,每个数据块最多包含1000条条目,数据块文件的每一行均为一个经过JSON编码的维基百科条目:
─ en/
├─ data/
│ ├─ 000000001.jsonl
│ ├─ 000000002.jsonl
│ └─ [...]
└─ metrics.json
每个语言子目录还包含该语言专属的汇总统计信息(`metrics.json`),以及存储数据块的`data`目录。
### JSON 数据模式
MegaWika 2的完整数据模式详见[`schema.md`](https://huggingface.co/datasets/jhu-clsp/megawika-2/blob/main/schema.md)。每条条目对象均包含条目标题、原始维基标记语(wikicode)与解析后的文本,以及代表条目结构的层级化对象。该层级结构包含多种元素,例如:
* 该层级结构的顶层为标题、段落、表格、信息框与其他块级元素组成的列表。
* 这些块级元素包含各类子元素;例如,每个段落包含一个句子列表。
* 每个句子包含句子文本、译文(英语)、以及引用列表。
* 每条引用包含原始维基标记语内容、该引用在句子文本中的字符索引、可选的引用URL,以及可选的已抓取来源文本。
## 统计信息
`metrics.json`文件(例如`en/metrics.json`)提供了每种语言对应数据集的汇总统计信息。MegaWika 2的覆盖范围较MegaWika 1更广,在引用检测与来源抓取/提取流程的召回率上实现了显著提升:
| 指标 | 版本1 | 版本2.0 | 提升幅度 |
|---------------------------------------|------------:|--------------:|--------------:|
| 收录条目数 | 2,072,726 | 9,841,417 | 375% |
| 检测到的网络引用数 | 17,368,499 | 57,431,369 | 231% |
| 成功抓取的网络引用数 | 5,623,386 | 23,544,500 | 319% |
| 网络引用抓取/提取召回率 | 32% | 41% | 27%(相对) |
## 更新日志
以下内容汇总了各版本间的差异;详细信息请参阅[`schema.md`](https://huggingface.co/datasets/jhu-clsp/megawika-2/blob/main/schema.md)中的数据模式说明。
### 2.0版(与MegaWika 1的差异)
MegaWika 2对MegaWika的数据结构进行了全面重构。相较于仅收录段落-来源对的初代版本,MegaWika 2不仅保留了文本及其引用来源,还完整保留了文本与周边维基百科条目的结构及关联关系。具体而言,每条条目包含从原始维基标记语解析得到的结构化元素列表,同时附带原始维基标记语以供参考。MegaWika 2中的段落元素已完成分句处理,可进一步支持下游研究工作。
与此同时,每条条目包含一组摘录(在MegaWika 1中称为*段落*),每个摘录可附带一个或多个引用,而初代版本仅支持每个段落关联一个引用。
MegaWika 2.0未包含MegaWika 1中的翻译概率、“重复翻译”标注、来源语言识别,以及生成的问答对,但新增了大量元数据,包括条目创建与最后修订日期、跨语言链接、作者提供的简短来源/引用片段,以及来源文本质量评估结果。
在此过程中,我们通过以下改进提升了引用提取流程的召回率:
- 新增对具名引用解析的支持
- 扩展了引用检测器可识别的引用语法覆盖范围
- 不仅收录包含可抓取URL的引用,还收录*所有*类型的引用,以支持研究人员全面研究维基百科的引用行为(包括跨语言场景)
- 提高了抓取来源的文本大小限制
引用检测召回率提升的相关统计信息详见[统计信息](#statistics)章节。数据集内的`metrics.json`文件(例如`en/metrics.json`)也提供了额外的统计数据。
MegaWika 2还改进了错误处理机制,整体覆盖范围进一步提升。数据中包含了来源抓取与提取过程的错误与元数据,可用于分析数据缺失的成因与潜在的数据偏差。
如需了解MegaWika 2.0数据集的更多细节与构建分析内容,请参阅我们发布在ArXiv上的[白皮书](https://arxiv.org/abs/2508.03828)。
提供机构:
maas
创建时间:
2025-09-10



