common-pile/wikiteam
收藏Hugging Face2025-06-06 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/common-pile/wikiteam
下载链接
链接失效反馈官方服务:
资源简介:
Wikiteam数据集包含了互联网上非 Wikimedia基金会管理的、使用MediaWiki软件的维基网站的存档。这些存档由wikiteam(一个志愿者团队)创建,并以CC BY、CC BY-SA或公有领域授权的方式上传到互联网档案馆。数据集包含了大约330,000个维基网站的最新存档,并将wikitext转换为纯文本格式。在预处理过程中,对wikitext中的数学公式进行了LATEX转换,并移除了HTML标签。此外,移除了包含大量版权洗白内容的文档,例如歌曲歌词或剧本的集合。每个文档的元数据字段中包含版权信息。
The Wikiteam dataset consists of archives from wikis on the internet that are not managed by the Wikimedia Foundation but use their MediaWiki software. These archives are created by wikiteam, a group of volunteers, and uploaded to the Internet Archive under CC BY, CC BY-SA, or public domain licenses. The dataset includes the latest dumps from approximately 330,000 wikis, converted from wikitext to plain text. During preprocessing, math formulas in wikitext are converted to LATEX, HTML tags are removed, and documents containing large amounts of licensing laundering, such as collections of song lyrics or transcripts, are excluded. Licensing information for each document is available in the metadata field.
提供机构:
common-pile



