five

finewiki

收藏
魔搭社区2026-05-22 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceFW/finewiki
下载链接
链接失效反馈
官方服务:
资源简介:
![FineWiki](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/nmYXgDCjHwhQpq5NaB4Uv.png) This is an **updated and better extracted** version of the `wikimedia/Wikipedia` dataset originally released in 2023. We carefully parsed [Wikipedia HTML dumps](https://dumps.wikimedia.org/other/enterprise_html/) from *August of 2025* covering 325 languages. ***This dataset:*** - [**fully renders templates**](https://huggingface.co/datasets/wikimedia/wikipedia/discussions/51) as it was extracted from HTML and not markdown dumps - **removes** redirects, disambiguation, and other non main article pages - includes **detailed metadata** such as page ID, title, last modified date, wikidate ID, version and markdown version of the text - preserves elements and formatting such as **headings, lists, code/pre blocks, tables and math content** - notably, `wikimedia/Wikipedia` removes all **tables and math content** - **excludes** most of the "References", "See also", "Notes", "External links", and similar **citations/notes sections** across all languages - besides keeping all math content, pages containing math are flagged with a **`has_math`** metadata attribute - **extracts infoboxes** (the summary high-level information boxes on the right of some wikipedia pages) in a **structured format** into the metadata, for RAG and other uses - only keeps pages whose **script (writing alphabet) matches** the expected list for that language - for non-English wikis, any page fully or mostly in **English is removed** (common issue for Language Identifiers/classifiers training) ## Visualize and Compare You can explore the dataset, compare it to `wikimedia/Wikipedia` and preview the live Wikipedia pages on our [space](https://huggingface.co/spaces/HuggingFaceFW/finewiki-viewer). ## Available subsets | Subset | Name | Size | Pages | |--------|------|------:|-------:| | `en` | [English](https://en.wikipedia.org) | 35.1 GB | 6,614,655 | | `de` | [German](https://de.wikipedia.org) | 13.1 GB | 2,713,646 | | `fr` | [French](https://fr.wikipedia.org) | 12.1 GB | 2,566,183 | | `ru` | [Russian](https://ru.wikipedia.org) | 10.7 GB | 1,817,813 | | `ja` | [Japanese](https://ja.wikipedia.org) | 9.9 GB | 1,354,269 | | `es` | [Spanish](https://es.wikipedia.org) | 8.5 GB | 1,948,965 | | `it` | [Italian](https://it.wikipedia.org) | 7.4 GB | 1,799,759 | | `uk` | [Ukrainian](https://uk.wikipedia.org) | 5.4 GB | 1,239,253 | | `zh` | [Chinese (writtenvernacular Chinese)](https://zh.wikipedia.org) | 5.1 GB | 1,295,955 | | `pl` | [Polish](https://pl.wikipedia.org) | 4.4 GB | 1,543,918 | | `ceb` | [Cebuano](https://ceb.wikipedia.org) | 4.4 GB | 5,647,436 | | `pt` | [Portuguese](https://pt.wikipedia.org) | 4.3 GB | 1,135,383 | | `nl` | [Dutch](https://nl.wikipedia.org) | 3.5 GB | 2,072,865 | | `ca` | [Catalan](https://ca.wikipedia.org) | 3.5 GB | 962,290 | | `ar` | [Arabic](https://ar.wikipedia.org) | 3.4 GB | 1,230,456 | | `sv` | [Swedish](https://sv.wikipedia.org) | 2.9 GB | 2,470,063 | | `cs` | [Czech](https://cs.wikipedia.org) | 2.2 GB | 534,563 | | `fa` | [Persian](https://fa.wikipedia.org) | 2.2 GB | 1,021,336 | | `vi` | [Vietnamese](https://vi.wikipedia.org) | 2.1 GB | 1,279,087 | | `hu` | [Hungarian](https://hu.wikipedia.org) | 2.1 GB | 515,004 | | `ko` | [Korean](https://ko.wikipedia.org) | 2.0 GB | 582,035 | | `he` | [Hebrew](https://he.wikipedia.org) | 2.0 GB | 372,053 | | `sr` | [Serbian](https://sr.wikipedia.org) | 2.0 GB | 664,345 | | `id` | [Indonesian](https://id.wikipedia.org) | 1.8 GB | 723,099 | | `tr` | [Turkish](https://tr.wikipedia.org) | 1.6 GB | 629,762 | | `fi` | [Finnish](https://fi.wikipedia.org) | 1.5 GB | 572,900 | | `no` | [Norwegian (Bokmål)](https://no.wikipedia.org) | 1.3 GB | 620,802 | | `el` | [Greek](https://el.wikipedia.org) | 1.2 GB | 242,517 | | `hy` | [Armenian](https://hy.wikipedia.org) | 1.2 GB | 309,820 | | `ro` | [Romanian](https://ro.wikipedia.org) | 1.2 GB | 493,462 | | ... | | | | | **Total** | | **184.7 GB** | 61,550,610| A detailed list is available [here](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv). ## How to download and use 🌐 FineWiki See the tables above for the `subset` of the language you want to download. We currently do not provide smaller `sample` versions, but by setting `limit` or using `streaming=True` you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on. ### Using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) ```python from datatrove.pipeline.readers import ParquetReader # limit determines how many documents will be streamed (remove for all) # this will fetch the Portuguese data data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) for document in data_reader(): # do something with document print(document) ############################### # OR for a processing pipeline: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run() ``` ### Using `huggingface_hub` ```python from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finewiki", repo_type="dataset", local_dir="./finewiki/", # download the English subset allow_patterns=["data/enwiki/*"]) ``` ### Using `datasets` ```python from datasets import load_dataset # get Spanish data fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True) ``` ## Dataset Structure ### Data Instances Example from the English subset (values truncated for readability): ```json { "text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...", "id": "enwiki/32552979", "wikiname": "enwiki", "page_id": 32552979, "title": "10th Tank Corps", "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps", "date_modified": "2023-07-26T12:32:03Z", "in_language": "en", "wikidata_id": "Q12061605", "bytes_html": 115017, "wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...", "version": 1167219203, "infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]", "has_math": false } ``` ### Data Fields - `text` (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists) - `id` (string): dataset‑unique identifier; typically `<wikiname>/<page_id>` - `wikiname` (string): wiki project name, e.g., `enwiki`, `ptwiki` - `page_id` (int): MediaWiki page identifier - `title` (string): article title - `url` (string): canonical article URL - `date_modified` (string): ISO‑8601 timestamp of the last page revision - `in_language` (string): article language code (e.g., `en`, `pt`) - `wikidata_id` (string|null): Wikidata QID associated with the page - `bytes_html` (int): size in bytes of the original HTML body - `wikitext` (string): original wikitext when available - `version` (int|string): revision/version identifier of the page - `infoboxes` (string): JSON‑encoded array of extracted infobox objects with title and key‑value data - `has_math` (bool): whether math content was detected on the page ## Data Processing The full pipeline processing code is available [here](https://huggingface.co/datasets/HuggingFaceFW/finewiki/tree/main/src). It runs on [datatrove](https://github.com/huggingface/datatrove/). While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns. ### Downloading We used the Wikimedia Enterprise HTML dump API (`https://api.enterprise.wikimedia.com/v2/snapshots`) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps: wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata. ### Extraction We heavily adapted [mwparserfromhtml](https://pypi.org/project/mwparserfromhtml/) to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a `has_math` flag) as well as tables, where much of the Wikipedia knowledge is contained. ### Filtering One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content. ## Licensing Information This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0. ## Citation Information ```bibtex @dataset{penedo2025finewiki, author = {Guilherme Penedo}, title = {FineWiki}, year = {2025}, publisher = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki}, urldate = {2025-10-20}, note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.} } ```

![FineWiki](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/nmYXgDCjHwhQpq5NaB4Uv.png) 本数据集是2023年发布的`wikimedia/Wikipedia`数据集的**优化提取升级版**。我们针对2025年8月发布的、覆盖325种语言的[维基媒体HTML转储文件](https://dumps.wikimedia.org/other/enterprise_html/)进行了精细解析。 ***本数据集:*** - **完整渲染模板**:由于从HTML而非markdown转储文件中提取,本数据集可完整渲染模板(详情参见[讨论帖](https://huggingface.co/datasets/wikimedia/wikipedia/discussions/51)) - **移除**重定向页面、消歧义页面及其他非主条目页面 - 包含**详细元数据**,例如页面ID、标题、最后修改时间、维基数据ID、版本号以及文本的markdown格式版本 - 保留各类元素与格式,包括**标题、列表、代码/预格式化块、表格与数学内容** - 值得注意的是,原`wikimedia/Wikipedia`数据集会移除所有**表格与数学内容** - **剔除**绝大多数语言版本中的「参考资料」「相关条目」「注释」「外部链接」等同类型**引用/注释章节** - 除保留全部数学内容外,包含数学内容的页面会通过**`has_math`**元数据属性进行标记 - **提取信息框**:将部分维基百科页面右侧的概要高级信息框以**结构化格式**提取至元数据中,适用于检索增强生成(Retrieval-Augmented Generation, RAG)及其他应用场景 - 仅保留**书写脚本匹配**对应语言预设列表的页面 - 对于非英文维基百科,所有完全或主要使用**英文**的页面都会被移除(这是语言识别器/分类器训练中常见的问题) ## 可视化与对比 您可通过我们的[互动空间](https://huggingface.co/spaces/HuggingFaceFW/finewiki-viewer)探索本数据集、与`wikimedia/Wikipedia`数据集进行对比,并预览维基百科的实时页面。 ## 可用子集 | 子集代码 | 语言名称 | 大小(GB) | 条目数 | |--------|------|------:|-------:| | `en` | [英语](https://en.wikipedia.org) | 35.1 | 6,614,655 | | `de` | [德语](https://de.wikipedia.org) | 13.1 | 2,713,646 | | `fr` | [法语](https://fr.wikipedia.org) | 12.1 | 2,566,183 | | `ru` | [俄语](https://ru.wikipedia.org) | 10.7 | 1,817,813 | | `ja` | [日语](https://ja.wikipedia.org) | 9.9 | 1,354,269 | | `es` | [西班牙语](https://es.wikipedia.org) | 8.5 | 1,948,965 | | `it` | [意大利语](https://it.wikipedia.org) | 7.4 | 1,799,759 | | `uk` | [乌克兰语](https://uk.wikipedia.org) | 5.4 | 1,239,253 | | `zh` | [中文(书面中文)](https://zh.wikipedia.org) | 5.1 | 1,295,955 | | `pl` | [波兰语](https://pl.wikipedia.org) | 4.4 | 1,543,918 | | `ceb` | [宿务语](https://ceb.wikipedia.org) | 4.4 | 5,647,436 | | `pt` | [葡萄牙语](https://pt.wikipedia.org) | 4.3 | 1,135,383 | | `nl` | [荷兰语](https://nl.wikipedia.org) | 3.5 | 2,072,865 | | `ca` | [加泰罗尼亚语](https://ca.wikipedia.org) | 3.5 | 962,290 | | `ar` | [阿拉伯语](https://ar.wikipedia.org) | 3.4 | 1,230,456 | | `sv` | [瑞典语](https://sv.wikipedia.org) | 2.9 | 2,470,063 | | `cs` | [捷克语](https://cs.wikipedia.org) | 2.2 | 534,563 | | `fa` | [波斯语](https://fa.wikipedia.org) | 2.2 | 1,021,336 | | `vi` | [越南语](https://vi.wikipedia.org) | 2.1 | 1,279,087 | | `hu` | [匈牙利语](https://hu.wikipedia.org) | 2.1 | 515,004 | | `ko` | [韩语](https://ko.wikipedia.org) | 2.0 | 582,035 | | `he` | [希伯来语](https://he.wikipedia.org) | 2.0 | 372,053 | | `sr` | [塞尔维亚语](https://sr.wikipedia.org) | 2.0 | 664,345 | | `id` | [印度尼西亚语](https://id.wikipedia.org) | 1.8 | 723,099 | | `tr` | [土耳其语](https://tr.wikipedia.org) | 1.6 | 629,762 | | `fi` | [芬兰语](https://fi.wikipedia.org) | 1.5 | 572,900 | | `no` | [挪威语(博克马尔语)](https://no.wikipedia.org) | 1.3 | 620,802 | | `el` | [希腊语](https://el.wikipedia.org) | 1.2 | 242,517 | | `hy` | [亚美尼亚语](https://hy.wikipedia.org) | 1.2 | 309,820 | | `ro` | [罗马尼亚语](https://ro.wikipedia.org) | 1.2 | 493,462 | | ... | | | | | **总计** | | **184.7** | **61,550,610** | 完整子集列表可参见[此处](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv)。 ## 下载与使用方法 🌐 FineWiki 请根据上文表格选择您所需语言的`子集`进行下载。我们目前未提供精简`样本`版本,但您可通过设置`limit`参数或使用`streaming=True`轻松获取数据样本。若社区有相关需求,我们后续可能会上传精简样本版本。 ### 使用 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) python from datatrove.pipeline.readers import ParquetReader # 用于限制流式传输的文档数量(移除该参数可获取全部数据) # 以下代码将获取葡萄牙语数据集 data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) for document in data_reader(): # 对文档进行处理 print(document) ############################### # 或使用处理流水线: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run() ### 使用 `huggingface_hub` python from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finewiki", repo_type="dataset", local_dir="./finewiki/", # 仅下载英语子集 allow_patterns=["data/enwiki/*"]) ### 使用 `datasets` 库 python from datasets import load_dataset # 获取西班牙语数据集 fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True) ## 数据集结构 ### 数据实例 以下为英语子集的示例(为便于阅读已截断部分字段值): json { "text": "# 10th Tank Corps The 10th Tank Corps was a tank corps of the Red Army, formed twice. ## First Formation In May–June 1938, ...", "id": "enwiki/32552979", "wikiname": "enwiki", "page_id": 32552979, "title": "10th Tank Corps", "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps", "date_modified": "2023-07-26T12:32:03Z", "in_language": "en", "wikidata_id": "Q12061605", "bytes_html": 115017, "wikitext": "{{short description|Tank corps of the Soviet military}} {{Infobox military unit...", "version": 1167219203, "infoboxes": "[{"title": "10th Tank Corps", "data": {"Active": "..."}}]", "has_math": false } ### 数据字段 - `text` (字符串):经过清洗的结构化文章文本,保留标题、列表、代码/预格式化块、表格与数学内容,包含部分markdown格式(标题、表格、列表) - `id` (字符串):数据集唯一标识符,格式通常为`<wikiname>/<page_id>` - `wikiname` (字符串):维基项目名称,例如`enwiki`、`ptwiki` - `page_id` (整数):MediaWiki页面标识符 - `title` (字符串):条目标题 - `url` (字符串):规范条目URL - `date_modified` (字符串):页面最后修订版本的ISO-8601时间戳 - `in_language` (字符串):条目语言代码(例如`en`、`pt`) - `wikidata_id` (字符串|null):关联页面的维基数据QID - `bytes_html` (整数):原始HTML主体的字节大小 - `wikitext` (字符串):可用时的原始维基文本 - `version` (整数|字符串):页面的修订/版本标识符 - `infoboxes` (字符串):JSON编码的提取信息框对象数组,包含标题与键值对数据 - `has_math` (布尔值):页面是否检测到数学内容 ## 数据处理流程 ### 数据下载来源 我们通过维基媒体企业版HTML转储API(`https://api.enterprise.wikimedia.com/v2/snapshots`)下载了不同语言版本维基百科的主命名空间(NS0)转储快照。我们刻意选择预渲染HTML而非更常用的维基文本/markdown转储文件:维基文本常将模板与格式编码为解析器函数或宏,导致难以精准还原维基页面的大部分内容,而企业版HTML已完成这些结构的展开。本次使用的是2025年8月的转储快照。我们将丰富的单页属性(ID、标题、URL、语言、版本、时间戳、维基数据ID)作为元数据的一部分进行记录。 ### 内容提取 我们对[mwparserfromhtml](https://pypi.org/project/mwparserfromhtml/)进行了大量适配,将HTML内容解析为干净的结构化文本表示,保留有意义的格式。我们通过维基文本/HTML中的重定向标记以及消歧义信号(包括维基数据ID与页面属性)可靠地移除了重定向页面与消歧义页面。我们通过精心整理的标题列表和结构线索(引用列表容器)剔除了填充非条目类冗余内容的类引用章节(例如「参考资料」「注释」「外部链接」等各语言本地化的章节),因此可在不损害主体内容的前提下移除引用与注释。我们过滤了视觉/导航类模板内容(目录、导航框、提示框、权限控制模块、分类),同时将信息框精心提取为元数据中的键值结构化数据,可用于知识搜索类应用。我们还致力于保留数学内容(并通过`has_math`标记包含数学内容的页面)以及表格——维基百科的大量知识均蕴含于表格之中。 ### 数据过滤 低资源语言维基百科的常见问题之一是存在大量其他语言的内容,尤其是英语(通常是从英文维基百科复制的条目或模板页面)。为确保语言质量与一致性,我们针对每个维基项目应用了适配语言与书写脚本的检查机制。仅当页面预测的书写脚本与对应语言的预设脚本匹配时,才会保留该页面。对于非英文维基百科,我们会移除英语占比超过置信度阈值的页面,以减少跨语言泄露问题。我们还会移除无信息框的超短页面,避免低信息含量的内容。 ## 授权信息 本数据集包含维基百科的文本内容,基于知识共享署名-相同方式共享4.0协议(Creative Commons Attribution-ShareAlike 4.0, CC BY-SA 4.0)授权,同时也可基于GNU自由文档许可证(GNU Free Documentation License, GFDL)使用。详见维基百科的授权与使用条款:https://dumps.wikimedia.org/legal.html。我们处理后的发布版本是对上述文本的改编,基于CC BY-SA 4.0协议授权。 ## 引用信息 bibtex @dataset{penedo2025finewiki, author = {Guilherme Penedo}, title = {FineWiki}, year = {2025}, publisher = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki}, urldate = {2025-10-20}, note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.} }
提供机构:
maas
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作