wikiteam
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/wikiteam
下载链接
链接失效反馈官方服务:
资源简介:
# Wikiteam
## Description
There are many wikis on the internet that are not managed by the Wikimedia foundation, but do use their MediaWiki software to power their wiki.
Many of these wikis have been archived by wikiteam, a collection of volunteers that create unofficial database dumps of wikis and upload them to the Internet Archive.
We download all dumps made by wikiteam when the metadata indicates the wiki was licensed under CC BY, CC BY-SA, or released into the public domain on the Internet Archive as of September of 2024.
This results in downloading approximately 330,000 wikis.
When multiple dumps of the same wiki exists, we use the most recent dump.
We converted wikitext to plain text using [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) after light adjustments in formatting to avoid errors in section ordering caused by a bug.
Before parsing, we converted wikitext math into LATEX math using our custom code.
Finally, any remaining HTML tags were removed via regexes.
After preprocessing, we removed documents from wikis that appeared to contain large amounts of license laundering, e.g. those that were collections of song lyrics or transcripts.
Per-document license information is available in the `license` entry of the `metadata` field of each example.
Code for collecting, processing, and preparing this dataset is available in the [common-pile GitHub repo](https://github.com/r-three/common-pile).
## Dataset Statistics
| Documents | UTF-8 GB |
|-----------|----------|
| 219,139,368 | 437.5 |
## License Issues
While we aim to produce datasets with completely accurate licensing information, license laundering and inaccurate metadata can cause us to erroneously assign the incorrect license to some documents (for further discussion of this limitation, please see [our paper](https://huggingface.co/papers/2506.05209)). If you believe you have found an instance of incorrect licensing in this dataset, please [start a discussion](https://github.com/r-three/common-pile/discussions/new) on this repository.
## Other Versions
This is the "raw" version of the Wikiteam dataset.
If you are looking for the filtered version used to train [Comma v0.1](https://huggingface.co/common-pile/comma-v0.1), you can find it [here](https://huggingface.co/datasets/common-pile/wikiteam_filtered).
## Citation
If you use this dataset, please cite:
```bibtex
@article{kandpal2025common,
title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}},
author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray},
journal={arXiv preprint},
year={2025}
}
```
# Wikiteam
## 数据集描述
互联网上存在诸多未由维基媒体基金会(Wikimedia Foundation)运营,但采用其MediaWiki(MediaWiki)软件搭建的维基站点。其中诸多站点已由Wikiteam完成归档——Wikiteam是一支志愿者团队,专为维基站点制作非官方的数据库转储文件,并上传至互联网档案馆(Internet Archive)。截至2024年9月,我们下载了互联网档案馆中所有由Wikiteam生成的、元数据(metadata)标注为采用CC BY、CC BY-SA许可协议或已进入公有领域的转储文件,最终总计下载约33万个维基站点。若同一维基站点存在多个转储文件,我们将选取最新版本。
我们先对格式进行轻微调整,以修复因代码漏洞导致的章节排序错误,随后借助wtf_wikipedia工具将维基标记语言(wikitext)转换为纯文本。在解析前,我们通过自研代码将维基标记语言中的数学公式转换为LaTeX格式数学表达式。最后,通过正则表达式移除所有残留的HTML标签。
预处理完成后,我们剔除了疑似存在大量许可洗钱行为的维基站点中的文档,例如包含大量歌词或转录文本的站点。每个样本的元数据(metadata)字段中的`license`项,即可查看对应文档的许可信息。本数据集的采集、处理与制备代码可在[common-pile GitHub仓库](https://github.com/r-three/common-pile)获取。
## 数据集统计
| 文档数量 | UTF-8 存储量(GB) |
|-----------|----------|
| 219,139,368 | 437.5 |
## 许可相关说明
尽管我们致力于确保数据集的许可信息完全准确,但许可洗钱与元数据失准仍可能导致我们为部分文档错误分配许可协议。如需进一步了解该局限性的相关讨论,请参阅[我们的论文](https://huggingface.co/papers/2506.05209)。若您发现本数据集存在许可信息错误的情况,请前往[本仓库](https://github.com/r-three/common-pile/discussions/new)发起讨论。
## 其他版本
本文件为Wikiteam数据集的“原始版”。若您需要用于训练[Comma v0.1](https://huggingface.co/common-pile/comma-v0.1)的过滤版数据集,可前往[此处](https://huggingface.co/datasets/common-pile/wikiteam_filtered)获取。
## 引用格式
若您使用本数据集,请引用以下文献:
bibtex
@article{kandpal2025common,
title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}},
author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray},
journal={arXiv预印本},
year={2025}
}
提供机构:
maas
创建时间:
2025-06-11



