five

adar1_data

收藏
魔搭社区2025-11-08 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/wannanfeng/adar1_data
下载链接
链接失效反馈
官方服务:
资源简介:
![FineWiki](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/nmYXgDCjHwhQpq5NaB4Uv.png) This is an **updated and better extracted** version of the `wikimedia/Wikipedia` dataset originally released in 2023. We carefully parsed [Wikipedia HTML dumps](https://dumps.wikimedia.org/other/enterprise_html/) from *August of 2025* covering 325 languages. ***This dataset:*** - [**fully renders templates**](https://huggingface.co/datasets/wikimedia/wikipedia/discussions/51) as it was extracted from HTML and not markdown dumps - **removes** redirects, disambiguation, and other non main article pages - includes **detailed metadata** such as page ID, title, last modified date, wikidate ID, version and markdown version of the text - preserves elements and formatting such as **headings, lists, code/pre blocks, tables and math content** - notably, `wikimedia/Wikipedia` removes all **tables and math content** - **excludes** most of the "References", "See also", "Notes", "External links", and similar **citations/notes sections** across all languages - besides keeping all math content, pages containing math are flagged with a **`has_math`** metadata attribute - **extracts infoboxes** (the summary high-level information boxes on the right of some wikipedia pages) in a **structured format** into the metadata, for RAG and other uses - only keeps pages whose **script (writing alphabet) matches** the expected list for that language - for non-English wikis, any page fully or mostly in **English is removed** (common issue for Language Identifiers/classifiers training) ## Visualize and Compare You can explore the dataset, compare it to `wikimedia/Wikipedia` and preview the live Wikipedia pages on our [space](https://huggingface.co/spaces/HuggingFaceFW/finewiki-viewer). ## Available subsets | Subset | Name | Size | Pages | |--------|------|------:|-------:| | `en` | [English](https://en.wikipedia.org) | 35.1 GB | 6,614,655 | | `de` | [German](https://de.wikipedia.org) | 13.1 GB | 2,713,646 | | `fr` | [French](https://fr.wikipedia.org) | 12.1 GB | 2,566,183 | | `ru` | [Russian](https://ru.wikipedia.org) | 10.7 GB | 1,817,813 | | `ja` | [Japanese](https://ja.wikipedia.org) | 9.9 GB | 1,354,269 | | `es` | [Spanish](https://es.wikipedia.org) | 8.5 GB | 1,948,965 | | `it` | [Italian](https://it.wikipedia.org) | 7.4 GB | 1,799,759 | | `uk` | [Ukrainian](https://uk.wikipedia.org) | 5.4 GB | 1,239,253 | | `zh` | [Chinese (writtenvernacular Chinese)](https://zh.wikipedia.org) | 5.1 GB | 1,295,955 | | `pl` | [Polish](https://pl.wikipedia.org) | 4.4 GB | 1,543,918 | | `ceb` | [Cebuano](https://ceb.wikipedia.org) | 4.4 GB | 5,647,436 | | `pt` | [Portuguese](https://pt.wikipedia.org) | 4.3 GB | 1,135,383 | | `nl` | [Dutch](https://nl.wikipedia.org) | 3.5 GB | 2,072,865 | | `ca` | [Catalan](https://ca.wikipedia.org) | 3.5 GB | 962,290 | | `ar` | [Arabic](https://ar.wikipedia.org) | 3.4 GB | 1,230,456 | | `sv` | [Swedish](https://sv.wikipedia.org) | 2.9 GB | 2,470,063 | | `cs` | [Czech](https://cs.wikipedia.org) | 2.2 GB | 534,563 | | `fa` | [Persian](https://fa.wikipedia.org) | 2.2 GB | 1,021,336 | | `vi` | [Vietnamese](https://vi.wikipedia.org) | 2.1 GB | 1,279,087 | | `hu` | [Hungarian](https://hu.wikipedia.org) | 2.1 GB | 515,004 | | `ko` | [Korean](https://ko.wikipedia.org) | 2.0 GB | 582,035 | | `he` | [Hebrew](https://he.wikipedia.org) | 2.0 GB | 372,053 | | `sr` | [Serbian](https://sr.wikipedia.org) | 2.0 GB | 664,345 | | `id` | [Indonesian](https://id.wikipedia.org) | 1.8 GB | 723,099 | | `tr` | [Turkish](https://tr.wikipedia.org) | 1.6 GB | 629,762 | | `fi` | [Finnish](https://fi.wikipedia.org) | 1.5 GB | 572,900 | | `no` | [Norwegian (Bokmål)](https://no.wikipedia.org) | 1.3 GB | 620,802 | | `el` | [Greek](https://el.wikipedia.org) | 1.2 GB | 242,517 | | `hy` | [Armenian](https://hy.wikipedia.org) | 1.2 GB | 309,820 | | `ro` | [Romanian](https://ro.wikipedia.org) | 1.2 GB | 493,462 | | ... | | | | | **Total** | | **184.7 GB** | 61,550,610| A detailed list is available [here](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv). ## How to download and use 🌐 FineWiki See the tables above for the `subset` of the language you want to download. We currently do not provide smaller `sample` versions, but by setting `limit` or using `streaming=True` you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on. ### Using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) ```python from datatrove.pipeline.readers import ParquetReader # limit determines how many documents will be streamed (remove for all) # this will fetch the Portuguese data data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) for document in data_reader(): # do something with document print(document) ############################### # OR for a processing pipeline: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run() ``` ### Using `huggingface_hub` ```python from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finewiki", repo_type="dataset", local_dir="./finewiki/", # download the English subset allow_patterns=["data/enwiki/*"]) ``` ### Using `datasets` ```python from datasets import load_dataset # get Spanish data fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True) ``` ## Dataset Structure ### Data Instances Example from the English subset (values truncated for readability): ```json { "text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...", "id": "enwiki/32552979", "wikiname": "enwiki", "page_id": 32552979, "title": "10th Tank Corps", "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps", "date_modified": "2023-07-26T12:32:03Z", "in_language": "en", "wikidata_id": "Q12061605", "bytes_html": 115017, "wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...", "version": 1167219203, "infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]", "has_math": false } ``` ### Data Fields - `text` (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists) - `id` (string): dataset‑unique identifier; typically `<wikiname>/<page_id>` - `wikiname` (string): wiki project name, e.g., `enwiki`, `ptwiki` - `page_id` (int): MediaWiki page identifier - `title` (string): article title - `url` (string): canonical article URL - `date_modified` (string): ISO‑8601 timestamp of the last page revision - `in_language` (string): article language code (e.g., `en`, `pt`) - `wikidata_id` (string|null): Wikidata QID associated with the page - `bytes_html` (int): size in bytes of the original HTML body - `wikitext` (string): original wikitext when available - `version` (int|string): revision/version identifier of the page - `infoboxes` (string): JSON‑encoded array of extracted infobox objects with title and key‑value data - `has_math` (bool): whether math content was detected on the page ## Data Processing The full pipeline processing code is available [here](https://huggingface.co/datasets/HuggingFaceFW/finewiki/tree/main/src). It runs on [datatrove](https://github.com/huggingface/datatrove/). While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns. ### Downloading We used the Wikimedia Enterprise HTML dump API (`https://api.enterprise.wikimedia.com/v2/snapshots`) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps: wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata. ### Extraction We heavily adapted [mwparserfromhtml](https://pypi.org/project/mwparserfromhtml/) to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a `has_math` flag) as well as tables, where much of the Wikipedia knowledge is contained. ### Filtering One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content. ## Licensing Information This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0. ## Citation Information ```bibtex @dataset{penedo2025finewiki, author = {Guilherme Penedo}, title = {FineWiki}, year = {2025}, publisher = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki}, urldate = {2025-10-20}, note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.} } ```

![FineWiki](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/nmYXgDCjHwhQpq5NaB4Uv.png) 这是2023年发布的`wikimedia/Wikipedia`数据集的**优化提取更新版**。我们从2025年8月的[维基媒体企业HTML转储](https://dumps.wikimedia.org/other/enterprise_html/)中进行了精心解析,覆盖325种语言。 ***本数据集特性***: - 完整渲染模板:由于从HTML而非markdown转储中提取,因此支持完整模板渲染 - 移除重定向页面、消歧义页面及其他非主条目页面 - 包含丰富元数据,例如页面ID、标题、最后修改时间、维基数据ID、版本号以及文本的markdown版本 - 保留各类元素与格式,包括**标题、列表、代码/预格式化块、表格与数学公式内容** - 需注意,原`wikimedia/Wikipedia`数据集会移除所有表格与数学公式内容 - 排除绝大多数语言版本中的“参考文献”“相关条目”“注释”“外部链接”等类似的引用/注释章节 - 除完整保留数学公式内容外,包含数学内容的页面会通过`has_math`元数据属性进行标记 - 将信息框(infobox)——部分维基百科页面右侧的概要高级信息框——以**结构化格式**提取并存入元数据,适用于检索增强生成(Retrieval Augmented Generation, RAG)及其他应用场景 - 仅保留书写字母脚本符合对应语言预期列表的页面 - 对于非英语维基百科,所有完全或主要使用英语的页面都会被移除(这是语言识别/分类器训练中常见的问题) ## 可视化与对比 您可以在我们的[Hugging Face空间](https://huggingface.co/spaces/HuggingFaceFW/finewiki-viewer)中探索本数据集、与`wikimedia/Wikipedia`数据集进行对比,以及预览实时维基百科页面。 ## 可用子集 | 子集代号 | 语言名称 | 大小 | 条目数量 | |--------|------|------:|-------:| | `en` | [英语](https://en.wikipedia.org) | 35.1 GB | 6,614,655 | | `de` | [德语](https://de.wikipedia.org) | 13.1 GB | 2,713,646 | | `fr` | [法语](https://fr.wikipedia.org) | 12.1 GB | 2,566,183 | | `ru` | [俄语](https://ru.wikipedia.org) | 10.7 GB | 1,817,813 | | `ja` | [日语](https://ja.wikipedia.org) | 9.9 GB | 1,354,269 | | `es` | [西班牙语](https://es.wikipedia.org) | 8.5 GB | 1,948,965 | | `it` | [意大利语](https://it.wikipedia.org) | 7.4 GB | 1,799,759 | | `uk` | [乌克兰语](https://uk.wikipedia.org) | 5.4 GB | 1,239,253 | | `zh` | [中文(书面汉语)](https://zh.wikipedia.org) | 5.1 GB | 1,295,955 | | `pl` | [波兰语](https://pl.wikipedia.org) | 4.4 GB | 1,543,918 | | `ceb` | [宿务语](https://ceb.wikipedia.org) | 4.4 GB | 5,647,436 | | `pt` | [葡萄牙语](https://pt.wikipedia.org) | 4.3 GB | 1,135,383 | | `nl` | [荷兰语](https://nl.wikipedia.org) | 3.5 GB | 2,072,865 | | `ca` | [加泰罗尼亚语](https://ca.wikipedia.org) | 3.5 GB | 962,290 | | `ar` | [阿拉伯语](https://ar.wikipedia.org) | 3.4 GB | 1,230,456 | | `sv` | [瑞典语](https://sv.wikipedia.org) | 2.9 GB | 2,470,063 | | `cs` | [捷克语](https://cs.wikipedia.org) | 2.2 GB | 534,563 | | `fa` | [波斯语](https://fa.wikipedia.org) | 2.2 GB | 1,021,336 | | `vi` | [越南语](https://vi.wikipedia.org) | 2.1 GB | 1,279,087 | | `hu` | [匈牙利语](https://hu.wikipedia.org) | 2.1 GB | 515,004 | | `ko` | [韩语](https://ko.wikipedia.org) | 2.0 GB | 582,035 | | `he` | [希伯来语](https://he.wikipedia.org) | 2.0 GB | 372,053 | | `sr` | [塞尔维亚语](https://sr.wikipedia.org) | 2.0 GB | 664,345 | | `id` | [印度尼西亚语](https://id.wikipedia.org) | 1.8 GB | 723,099 | | `tr` | [土耳其语](https://tr.wikipedia.org) | 1.6 GB | 629,762 | | `fi` | [芬兰语](https://fi.wikipedia.org) | 1.5 GB | 572,900 | | `no` | [挪威语(博克马尔语)](https://no.wikipedia.org) | 1.3 GB | 620,802 | | `el` | [希腊语](https://el.wikipedia.org) | 1.2 GB | 242,517 | | `hy` | [亚美尼亚语](https://hy.wikipedia.org) | 1.2 GB | 309,820 | | `ro` | [罗马尼亚语](https://ro.wikipedia.org) | 1.2 GB | 493,462 | | ... | | | | | **总计** | | **184.7 GB** | 61,550,610| 完整的子集列表可在[此处](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv)查看。 ## 如何下载与使用 🌐 FineWiki 请根据上表选择您需要的语言子集进行下载。 目前我们未提供精简采样版本,但您可以通过设置`limit`参数或使用`streaming=True`轻松获取数据样本。若社区有相关需求,我们后续可能会上传小型采样版本。 ### 使用 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) python from datatrove.pipeline.readers import ParquetReader # limit参数用于限制流式加载的文档数量(移除该参数可加载全部数据) # 以下代码将获取葡萄牙语子集数据 data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000) for document in data_reader(): # 在此处对文档进行处理 print(document) ############################### # 或使用处理流水线: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run() ### 使用 `huggingface_hub` python from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finewiki", repo_type="dataset", local_dir="./finewiki/", # 仅下载英语子集 allow_patterns=["data/enwiki/*"]) ### 使用 `datasets` 库 python from datasets import load_dataset # 获取西班牙语数据 fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True) ## 数据集结构 ### 数据实例 以下为英语子集的示例(为便于阅读已截断部分内容): json { "text": "# 第10坦克军 第10坦克军是苏联红军的坦克军,曾两次组建。 ## 首次组建 1938年5月至6月间,...", "id": "enwiki/32552979", "wikiname": "enwiki", "page_id": 32552979, "title": "第10坦克军", "url": "https://en.wikipedia.org/wiki/10th_Tank_Corps", "date_modified": "2023-07-26T12:32:03Z", "in_language": "en", "wikidata_id": "Q12061605", "bytes_html": 115017, "wikitext": "{{short description|苏联军事的坦克军}} {{Infobox military unit...", "version": 1167219203, "infoboxes": "[{"title": "第10坦克军", "data": {"Active": "..."}}]", "has_math": false } ### 数据字段说明 - `text`(字符串):经过清洗的结构化条目文本,保留标题、列表、代码/预格式化块、表格与数学公式内容,带有部分markdown格式(标题、表格、列表) - `id`(字符串):数据集唯一标识符,通常格式为`<wikiname>/<page_id>` - `wikiname`(字符串):维基项目名称,例如`enwiki`、`ptwiki` - `page_id`(整数):MediaWiki页面标识符 - `title`(字符串):条目标题 - `url`(字符串):条目的规范URL - `date_modified`(字符串):页面最后修订版本的ISO-8601时间戳 - `in_language`(字符串):条目的语言代码(例如`en`、`pt`) - `wikidata_id`(字符串|null):关联至该条目的维基数据QID - `bytes_html`(整数):原始HTML正文的字节大小 - `wikitext`(字符串):可用时的原始维基文本 - `version`(整数|字符串):页面的修订/版本标识符 - `infoboxes`(字符串):JSON编码的提取信息框对象数组,包含标题与键值对数据 - `has_math`(布尔值):页面中是否检测到数学公式内容 ## 数据处理 完整的流水线处理代码可在[此处](https://huggingface.co/datasets/HuggingFaceFW/finewiki/tree/main/src)获取,基于[datatrove](https://github.com/huggingface/datatrove/)开发。尽管我们尝试为绝大多数维基百科语言变体提供鲁棒的支持,但由于HTML层面缺乏标准化,部分子集的提取效果可能未尽理想。若您关注的语言子集存在此类问题,我们建议您调整本代码以适配具体需求。 ### 数据下载 我们使用维基媒体企业HTML转储API(`https://api.enterprise.wikimedia.com/v2/snapshots`)下载了不同语言版本维基百科的主命名空间(NS0)快照。我们有意选择预渲染HTML而非更常用的维基文本/markdown转储:维基文本常将模板与格式编码为解析器函数/宏,这会使得大规模维基页面的忠实重建更为困难,而企业HTML已完成了这些结构的展开。本次处理使用了2025年8月的快照。我们将丰富的单页属性(ID、标题、URL、语言、版本、时间戳、维基数据ID)作为元数据进行记录。 ### 内容提取 我们对[mwparserfromhtml](https://pypi.org/project/mwparserfromhtml/)进行了大量适配,以将HTML内容解析为干净的结构化文本表示,同时保留有意义的格式。通过维基文本/HTML中的重定向标记、消歧义信号(包括维基数据ID与页面属性),我们可靠地移除了重定向与消歧义页面。通过精心整理的标题列表与结构线索(引用列表容器),我们排除了包含非条目类无意义内容的引用类章节(例如“参考文献”“注释”“外部链接”,各语言版本有本地化名称),因此在不损害主体内容的前提下移除了引用/注释内容。视觉/导航类冗余内容(目录、导航框、提示框、权限控制、分类)被过滤掉,而信息框则被仔细提取为键值结构化数据并存入元数据,可用于知识搜索类应用。我们同时致力于保留数学公式内容(并通过`has_math`标记包含数学内容的页面)与表格——维基百科的大量知识均蕴含于这些内容之中。 ### 内容过滤 低资源语言维基百科的一个常见问题是存在大量其他语言的内容,尤其是英语(通常来自从英语维基百科复制的条目或冗余页面)。为确保语言质量与一致性,我们针对每个维基实例应用了语言与脚本感知的检查机制。仅当页面的预测书写系统符合对应语言的预期脚本时,该页面才会被保留。对于非英语维基百科,所有语言占比超过置信度阈值的英语页面都会被移除,以减少跨语言泄露。我们同时会移除过短且未包含信息框的页面,以避免低信号内容。 ## 许可信息 本数据集包含来自维基百科的文本,根据知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0)进行许可,同时也可在GNU自由文档许可证(GFDL)下使用。请参阅维基百科的许可与使用条款:https://dumps.wikimedia.org/legal.html 本处理后的发布版本是对上述文本的改编,同样采用CC BY-SA 4.0协议许可。 ## 引用信息 bibtex @dataset{penedo2025finewiki, author = {Guilherme Penedo}, title = {FineWiki}, year = {2025}, publisher = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki}, urldate = {2025-10-20}, note = {来源:维基媒体企业快照API(https://api.enterprise.wikimedia.com/v2/snapshots)。文本基于CC BY-SA 4.0协议许可,致谢维基百科贡献者。} }
提供机构:
maas
创建时间:
2025-11-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作