five

open-index/open-wikipedia

收藏
Hugging Face2026-04-09 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/open-index/open-wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - la license: cc-by-sa-4.0 task_categories: - text-generation - feature-extraction - text-classification - question-answering - summarization - translation pretty_name: Open Wikipedia (Wikitext) tags: - wikipedia - encyclopedia - knowledge - wikitext - mediawiki - multilingual - wikimedia - open-data size_categories: - 100K<n<1M configs: - config_name: la data_files: - split: train path: data/la/*.parquet dataset_info: - config_name: la features: - name: id dtype: int64 - name: title dtype: string - name: wikitext dtype: string - name: url dtype: string - name: lang dtype: string - name: length dtype: int32 - name: timestamp dtype: string splits: - name: train num_examples: 139421 --- # Open Wikipedia (Wikitext) > Every Wikipedia article in original MediaWiki markup, 139.4K articles across 1 languages ## What is it? This dataset contains every article from every language edition of [Wikipedia](https://www.wikipedia.org/) in its **original MediaWiki wikitext source markup**. Nothing has been converted, stripped, or simplified. Templates, infoboxes, references, tables, categories, file links, and all other MediaWiki constructs are preserved exactly as the Wikipedia editors wrote them. The source data comes from the official [Wikimedia database dumps](https://dumps.wikimedia.org/). Each language's full XML export is streamed, parsed, and stored article by article as sharded Apache Parquet files with Zstandard compression, organized by language. This is the raw source variant of the Open Wikipedia collection. If you want articles converted to clean Markdown, see [open-index/open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown). If you want plain text with all markup removed, see [open-index/open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text). **139.4K articles** | **1 languages** | **Last updated: 2026-04-03** | **License: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)** ## What is being released? The dataset is organized as one directory per language, with sharded Parquet files inside each: ``` data/ en/en-00000.parquet English, shard 0 en-00001.parquet English, shard 1 ... de/de-00000.parquet German fr/fr-00000.parquet French es/es-00000.parquet Spanish ja/ja-00000.parquet Japanese ... la/la-00000.parquet Latin ``` Each Parquet file contains up to 500,000 rows. Languages with fewer articles fit in a single shard. All files use Zstandard compression. ## How to download and use this dataset ### Using DuckDB DuckDB can read Parquet files directly from Hugging Face without downloading anything first. ```sql -- Count articles per language SELECT lang, COUNT(*) as articles FROM read_parquet('hf://datasets/open-index/open-wikipedia/data/*/*.parquet') GROUP BY lang ORDER BY articles DESC; ``` ```sql -- Find articles that use a specific template SELECT title, lang, length, url FROM read_parquet('hf://datasets/open-index/open-wikipedia/data/en/*.parquet') WHERE wikitext LIKE '%{{Infobox country%' LIMIT 20; ``` ```sql -- Count how many articles use infoboxes, per language SELECT lang, COUNT(*) AS with_infobox, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY lang), 1) AS pct FROM read_parquet('hf://datasets/open-index/open-wikipedia/data/*/*.parquet') WHERE wikitext LIKE '%{{Infobox%' GROUP BY lang ORDER BY with_infobox DESC LIMIT 20; ``` ```sql -- Article size distribution for English SELECT percentile_disc(0.25) WITHIN GROUP (ORDER BY length) AS p25, percentile_disc(0.50) WITHIN GROUP (ORDER BY length) AS p50, percentile_disc(0.75) WITHIN GROUP (ORDER BY length) AS p75, percentile_disc(0.90) WITHIN GROUP (ORDER BY length) AS p90, percentile_disc(0.99) WITHIN GROUP (ORDER BY length) AS p99, AVG(length)::INT AS avg_length, MAX(length) AS max_length FROM read_parquet('hf://datasets/open-index/open-wikipedia/data/en/*.parquet'); ``` ```sql -- Find the longest wikitext articles across all languages SELECT lang, title, length, url FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY lang ORDER BY length DESC) AS rn FROM read_parquet('hf://datasets/open-index/open-wikipedia/data/*/*.parquet') ) WHERE rn = 1 ORDER BY length DESC LIMIT 20; ``` ### Using `datasets` ```python from datasets import load_dataset # Load English Wikipedia ds = load_dataset("open-index/open-wikipedia", "en") print(ds["train"][0]["title"]) print(ds["train"][0]["wikitext"][:500]) # Stream the full dataset without downloading everything ds = load_dataset("open-index/open-wikipedia", "en", split="train", streaming=True) for item in ds: print(item["title"], item["length"]) # Load a specific language ds = load_dataset("open-index/open-wikipedia", "de") print(f"German articles: {len(ds['train']):,}") ``` ### Using `huggingface_hub` ```python from huggingface_hub import snapshot_download # Download only English snapshot_download( "open-index/open-wikipedia", repo_type="dataset", local_dir="./wiki-wt/", allow_patterns="data/en/*", ) ``` For faster downloads, install `pip install huggingface_hub[hf_transfer]` and set `HF_HUB_ENABLE_HF_TRANSFER=1`. ### Using the CLI ```bash # Download a single language huggingface-cli download open-index/open-wikipedia \ --include "data/la/*" \ --repo-type dataset --local-dir ./wiki-wt/ ``` ### Template analysis Since the wikitext is unmodified, you can analyze template usage patterns across Wikipedia: ```python import re from collections import Counter from datasets import load_dataset ds = load_dataset("open-index/open-wikipedia", "en", split="train") # Count template usage across all English articles template_counts = Counter() for row in ds: for m in re.finditer(r'\{\{(\w+)', row["wikitext"]): template_counts[m.group(1)] += 1 # Top 20 most-used templates for name, count in template_counts.most_common(20): print(f" {name}: {count:,}") ``` ## Dataset statistics ### Languages | Language | Code | Articles | Shards | |----------|------|----------|--------| | Latin | `la` | 139.4K | 1 | ## Schema Every Parquet file shares the same schema: | Column | Type | Description | |--------|------|-------------| | `id` | `int64` | Wikipedia page ID, unique within each language edition | | `title` | `string` | Article title as it appears on Wikipedia | | `wikitext` | `string` | Full article body in original MediaWiki wikitext markup | | `url` | `string` | Direct URL to the Wikipedia article | | `lang` | `string` | ISO 639 language code (e.g. `en`, `de`, `fr`, `ja`) | | `length` | `int32` | Wikitext body length in bytes | | `timestamp` | `string` | Last revision timestamp in ISO 8601 format | ### Example instance Here is an example row from the English partition, showing the raw wikitext source: ```json { "id": 12, "title": "Anarchism", "wikitext": "{{Short description|Political philosophy and movement}}\n{{pp-semi-indef}}\n{{Use British English|date=August 2021}}\n'''Anarchism''' is a [[political philosophy]] and [[Political movement|movement]] that is against all forms of [[authority]]...", "url": "https://en.wikipedia.org/wiki/Anarchism", "lang": "en", "length": 128934, "timestamp": "2025-12-15T08:22:01Z" } ``` The `wikitext` field contains the complete source markup as stored in the Wikipedia database. This includes templates, references, categories, file links, and all other MediaWiki syntax. ## What is included in the wikitext Unlike the Markdown and plain text variants, this dataset preserves every element of the source markup: | Element | Example | |---------|---------| | Templates | `{{Infobox country\|name=...}}` | | Internal links | `[[United States\|US]]` | | External links | `[https://example.com text]` | | Headings | `== Section ==`, `=== Subsection ===` | | Formatting | `'''bold'''`, `''italic''` | | Tables | `{| class="wikitable" ... \|}` | | References | `<ref name="...">...</ref>` | | Categories | `[[Category:Countries]]` | | File and image links | `[[File:Map.svg\|thumb\|Caption]]` | | HTML tags | `<code>`, `<pre>`, `<syntaxhighlight>` | | Magic words | `__NOTOC__`, `__FORCETOC__` | | Comments | `<!-- editor notes -->` | | Parser functions | `{{#if:...|...|...}}` | | Lua module calls | `{{#invoke:Module\|function}}` | ## How it works The pipeline processes all 1 Wikipedia language editions through the following steps: 1. **Download.** The latest `{lang}wiki-latest-pages-articles.xml.bz2` dump is streamed from [dumps.wikimedia.org](https://dumps.wikimedia.org/). Downloads support HTTP range resumption, so interrupted transfers pick up where they left off. 2. **Parse.** A streaming XML parser processes the bzip2-compressed dump without extracting it to disk. Only namespace-0 pages (articles) are kept. Redirects, talk pages, user pages, and all other namespaces are skipped. 3. **Filter.** Articles shorter than 100 bytes (based on their plain-text equivalent) are excluded. This removes stubs, disambiguation pages, and other pages with minimal content. 4. **Shard.** Articles are written to Zstandard-compressed Parquet files, approximately 500,000 rows per shard. Multiple languages are processed in parallel using a worker pool. 5. **Publish.** Each language's shards are committed to this Hugging Face repository as they complete. Note that no conversion or transformation is applied to the wikitext content. The text you see in this dataset is the same text that Wikipedia's MediaWiki engine processes to render the pages you see in your browser. ## Considerations ### Why preserve the raw wikitext? The Markdown and plain text variants of this dataset apply lossy transformations to the source content. Templates are stripped, tables are removed, and references are discarded. While this produces cleaner output for most NLP tasks, it also throws away structured information. The raw wikitext is useful when you need to: - **Build or benchmark wikitext parsers.** If you are developing a MediaWiki parser or converter, this dataset provides millions of real-world test cases across hundreds of languages. - **Analyze template usage patterns.** Infoboxes, citation templates, and navigation templates encode structured knowledge that is lost in other formats. - **Extract structured data from infoboxes.** Many Wikipedia articles have infobox templates that contain key-value pairs, which can be parsed to build structured knowledge bases. - **Study editing conventions.** The raw markup reveals how different language communities organize and format their articles. - **Train markup-aware models.** Language models trained on wikitext can learn to generate or complete MediaWiki markup, which is useful for editing tools and bots. ### Known limitations - **This is the raw page source, not rendered HTML.** Templates are not expanded, parser functions are not evaluated, and Lua modules are not executed. What you see is the source code, not the output. - **One snapshot in time.** This dataset represents a single snapshot of each language's dump. It does not include edit history or article revisions. - **Dump availability varies.** Not all language editions have their dumps available at all times. - **Some articles are very large.** A small number of articles (lists, timelines, reference pages) can have wikitext bodies exceeding 1 MB. ## Related datasets - [open-index/open-wikipedia-markdown](https://huggingface.co/datasets/open-index/open-wikipedia-markdown) - Same articles converted to clean Markdown. Headings, bold, italic, code blocks, and links are preserved while templates and references are stripped. - [open-index/open-wikipedia-text](https://huggingface.co/datasets/open-index/open-wikipedia-text) - Same articles as pure plain text with all formatting and markup removed. Smallest files, best for embeddings and classification. ## Thanks The content in this dataset was written by millions of Wikipedia editors worldwide and is hosted by the [Wikimedia Foundation](https://www.wikimedia.org/). The raw data comes from the [Wikimedia database dumps](https://dumps.wikimedia.org/), which the Foundation makes freely available for download. Wikipedia is one of humanity's greatest collaborative achievements. All credit for the content goes to the volunteer editors who write, review, and maintain it. This dataset is an independent mirror and is not affiliated with or endorsed by the Wikimedia Foundation. ## Licensing Wikipedia content is released under the [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) (CC BY-SA 4.0). This dataset inherits that license. If you redistribute or build upon this data, you must give appropriate credit and share your contributions under the same license. ## Citation ```bibtex @dataset{open_wikipedia, title = {Open Wikipedia (Wikitext)}, author = {Open Index}, year = {2026}, url = {https://huggingface.co/datasets/open-index/open-wikipedia}, license = {CC BY-SA 4.0}, publisher = {Hugging Face} } ``` _Last updated: 2026-04-03_
提供机构:
open-index
搜集汇总
数据集介绍
main_image_url
构建方式
在数字人文与自然语言处理领域,大规模高质量语料库的构建是推动研究进展的关键。Open Wikipedia (Wikitext) 数据集的构建遵循了系统化的工程流程。其源数据直接来自维基媒体基金会官方发布的数据库转储文件,通过流式XML解析技术,从压缩的原始文件中逐篇提取文章。处理过程严格筛选了命名空间,仅保留主条目文章,并过滤了内容过短的存根页。最终,文章以其原始的MediaWiki维基文本标记格式保存,未经任何转换或简化,完整保留了模板、信息框、参考文献等所有结构化元素。数据以Zstandard压缩的Apache Parquet文件格式分片存储,确保了高效存取与组织。
特点
该数据集的核心特征在于其内容的原始性与完整性。与常见的纯文本或Markdown转换版本不同,它完整保留了维基百科条目的原生维基文本标记,包括所有模板调用、内部链接、格式化语法及元数据。这种原始格式为研究提供了独特的价值,使得分析模板使用模式、提取结构化知识或训练能够理解维基标记的语言模型成为可能。数据集覆盖多种语言,并以标准化模式组织,每篇文章均附带标题、URL、时间戳及文本长度等元信息,为跨语言比较与大规模分析提供了坚实基础。
使用方法
该数据集为研究者提供了灵活多样的使用途径。用户可通过Hugging Face的`datasets`库直接加载特定语言的分区,并支持流式读取以处理海量数据。对于需要执行复杂查询或聚合分析的任务,推荐使用DuckDB直接远程查询Parquet文件,无需预先下载整个数据集。此外,利用`huggingface_hub`工具或命令行界面,可以实现对特定语言文件的精准下载。典型应用场景包括:分析全语言范围内的模板分布、基于原始标记训练专用解析器、或从信息框中提取结构化知识以构建知识图谱。
背景与挑战
背景概述
Open Wikipedia (Wikitext) 数据集由 Open Index 团队于2026年构建,旨在为自然语言处理与知识工程领域提供原始、未经转换的维基百科文章MediaWiki标记文本。该数据集源自维基媒体基金会的官方数据库转储,覆盖多种语言版本,严格保留了编辑者撰写的所有模板、信息框、引用及内部链接等结构化元素。其核心研究问题聚焦于如何利用原始维基文本支持标记解析、结构化知识抽取及多语言模型训练,为构建更精准的语义理解与内容生成系统奠定了数据基础,显著推动了开放知识表示与计算语言学交叉领域的发展。
当前挑战
该数据集致力于解决从非结构化或半结构化文本中提取精确语义与知识的根本挑战,尤其在处理多语言、多模态的维基百科内容时,需应对标记语言的复杂嵌套与动态模板解析问题。构建过程中的主要挑战包括:需设计高效的流式XML解析管道以处理TB级原始转储数据,同时确保在分布式处理中维持数据完整性与一致性;此外,原始维基文本中大量未展开的模板函数、Lua模块及跨语言引用结构,为数据清洗与标准化带来了语义保真度与计算效率之间的权衡难题。
常用场景
经典使用场景
在自然语言处理领域,大规模语料库的构建与利用是推动模型发展的基石。Open Wikipedia (Wikitext) 数据集以其原始MediaWiki标记格式,为研究者提供了未经转换的维基百科文章源文本。这一特性使其成为训练和评估能够理解和生成复杂结构化文本的语言模型的理想资源。经典使用场景包括基于原始维基文本进行自回归语言建模预训练,使模型能够学习到丰富的知识表示和格式化语法,为后续的下游任务奠定坚实基础。
衍生相关工作
围绕该数据集及其原始文本格式,已衍生出一系列经典研究工作。早期如GPT系列模型的部分训练数据便包含了维基百科文本,奠定了大语言模型的知识基础。专门针对维基文本解析的工作,如WikiExtractor等工具,旨在更精准地剥离模板与内容。近年来,更多研究聚焦于利用原始标记训练模型以完成特定任务,例如基于模板和链接预测的实体链接、通过解析信息框进行关系抽取,以及开发能够直接编辑维基标记的AI助手,这些工作都深度依赖此类保真度极高的数据集。
数据集最近研究
最新研究方向
在自然语言处理领域,基于维基百科原始标记文本的数据集正成为结构化知识抽取与多语言模型训练的前沿焦点。该数据集完整保留了MediaWiki的原始标记,包括模板、信息框及引用等结构化元素,为研究社区提供了丰富的语义与句法资源。当前研究热点集中于利用此类原始文本训练能够理解和生成维基标记的大语言模型,以支持自动化内容编辑与知识库构建。同时,学者们正探索从信息框模板中提取实体关系,用于增强知识图谱的覆盖范围与准确性。这一方向不仅推动了多语言环境下语义解析技术的发展,也为文化遗产数字化与低资源语言处理提供了关键数据基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作