common-pile/wikiteam_filtered
收藏Hugging Face2025-06-06 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/common-pile/wikiteam_filtered
下载链接
链接失效反馈官方服务:
资源简介:
Wikiteam数据集是一个由志愿者归档的维基百科文本数据集,包含了大约33万个使用MediaWiki软件但不由维基媒体基金会管理的维基百科。数据集中的文本从wikitext格式转换为纯文本格式,并对格式和数学表达式进行了处理。此外,数据集还经过了清洗,移除了一些包含大量许可洗白行为的文档。该数据集大约有超过2600万份文档,分为UTF-8编码的文件。这是一个经过过滤的版本,适用于研究和项目,并且提供了引用信息。
The Wikiteam dataset is a collection of text data from various wikis archived by volunteers, which are not managed by the Wikimedia Foundation but use their MediaWiki software. It includes approximately 330,000 wikis with text converted from wikitext to plain text and processed for formatting and mathematical expressions. The dataset has been cleaned to remove documents containing significant amounts of license laundering. It consists of over 26 million documents in UTF-8 encoding. This is the filtered version of the dataset, suitable for research and projects, with citation information provided.
提供机构:
common-pile



