wikisource

Name: wikisource
Creator: maas
Published: 2025-12-05 16:21:10
License: 暂无描述

魔搭社区2025-12-05 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/wikimedia/wikisource

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Wikimedia Wikisource ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://dumps.wikimedia.org - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Wikisource dataset containing cleaned articles of all languages. The dataset is built from the Wikisource dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikisource text with cleaning to strip markdown and unwanted sections (references, etc.). All language subsets have already been processed for recent dump, and you can load them by date and language like this: ```python from datasets import load_dataset ds = load_dataset("wikimedia/wikisource", "20231201.en") ``` ### Supported Tasks and Leaderboards The dataset is generally used for Language Modeling. ### Languages You can find the list of all languages here: https://meta.wikimedia.org/wiki/Wikisource#List_of_Wikisources Note that the wiki code "www" contains multilingual texts. You can find the list of languages at the "www" Multilingual Wikisource here: https://wikisource.org/wiki/Wikisource:Languages ## Dataset Structure ### Data Instances An example looks as follows: ``` {'id': '36', 'url': 'https://ca.wikisource.org/wiki/Comunicat%20de%20Berl%C3%ADn', 'title': 'Comunicat de Berlín', 'text': "\n\nPreàmbul \nEl 19 de juny de 1999, un any després de la Declaració de la Sorbona,..." } ``` ### Data Fields The data fields are the same among all language configurations: - `id` (`str`): ID of the text. - `url` (`str`): URL of the text. - `title` (`str`): Title of the text. - `text` (`str`): Content of the text. ### Data Splits All language configurations contain a single `train` split. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The dataset is built from the Wikisource dumps: https://dumps.wikimedia.org You can find the full list of languages and dates here: https://dumps.wikimedia.org/backup-index.html The articles have been parsed using the [`mwparserfromhell`](https://mwparserfromhell.readthedocs.io) tool. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Copyright licensing information: https://dumps.wikimedia.org/legal.html All original textual content is licensed under the [GNU Free Documentation License](https://www.gnu.org/licenses/fdl-1.3.html) (GFDL) and the [Creative Commons Attribution-Share-Alike 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/). Some text may be available only under the Creative Commons license; see their [Terms of Use](https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use) for details. Text written by some authors may be released under additional licenses or into the public domain. ### Citation Information ``` @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" } ``` ### Contributions Thanks to [@albertvillanova](https://huggingface.co/albertvillanova) for adding this dataset.

# 维基媒体维基文库（Wikisource）数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [支持语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**：https://dumps.wikimedia.org - **代码仓库**： - **相关论文**： - **排行榜**： - **联系人**： ### 数据集概述本数据集收录多语言的清洗后维基文库（Wikisource）文章。数据集源自维基媒体 dumps 站点（https://dumps.wikimedia.org/），按语言划分子集，每个语言子集仅包含一个训练划分。每条数据实例对应一篇完整的维基文库文本，已经过清洗处理，移除了Markdown格式与无关章节（如参考文献等）。所有语言子集均已基于最新的dumps文件完成处理，你可以通过如下方式按日期与语言加载数据集： python from datasets import load_dataset ds = load_dataset("wikimedia/wikisource", "20231201.en") ### 支持任务与排行榜本数据集通常用于语言建模任务。 ### 支持语言所有支持的语言列表可参见：https://meta.wikimedia.org/wiki/Wikisource#List_of_Wikisources 请注意，wiki代码为`www`的子集包含多语言文本，该多语言维基文库的语言列表可参见：https://wikisource.org/wiki/Wikisource:Languages ## 数据集结构 ### 数据实例一条数据示例如下： {'id': '36', 'url': 'https://ca.wikisource.org/wiki/Comunicat%20de%20Berl%C3%ADn', 'title': 'Comunicat de Berlín', 'text': " Preàmbul El 19 de juny de 1999, un any després de la Declaració de la Sorbona,..." } ### 数据字段所有语言配置的数据字段均保持一致： - `id`（`str`类型）：文本的唯一标识符。 - `url`（`str`类型）：文本的原始URL。 - `title`（`str`类型）：文本的标题。 - `text`（`str`类型）：文本的内容。 ### 数据划分所有语言配置均仅包含一个`train`（训练）划分。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化本数据集源自维基媒体 dumps 站点：https://dumps.wikimedia.org 完整的语言与时间列表可参见：https://dumps.wikimedia.org/backup-index.html 文本内容通过[`mwparserfromhell`](https://mwparserfromhell.readthedocs.io)工具进行解析。 #### 源语言创作者信息 [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者信息 [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可证信息版权许可证信息：https://dumps.wikimedia.org/legal.html 所有原始文本内容均采用[GNU自由文档许可证（GNU Free Documentation License，GFDL）](https://www.gnu.org/licenses/fdl-1.3.html)与[知识共享署名-相同方式共享3.0许可证（Creative Commons Attribution-Share-Alike 3.0 License）](https://creativecommons.org/licenses/by-sa/3.0/)进行授权。部分文本可能仅采用知识共享许可证授权，详情请参见其[使用条款](https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use)。部分由特定作者创作的文本可能采用其他许可证授权或已进入公有领域。 ### 引用信息 @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" } ### 贡献者感谢[@albertvillanova](https://huggingface.co/albertvillanova)为本数据集的添加工作。

提供机构：

maas

创建时间：

2025-01-20

搜集汇总

数据集介绍