nkandpa2/wiki-dolma

Name: nkandpa2/wiki-dolma
Creator: nkandpa2
Published: 2024-10-31 18:47:45
License: 暂无描述

Hugging Face2024-10-31 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/nkandpa2/wiki-dolma

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: - "wiki/archive/v3/documents/*.jsonl.gz" - config_name: wikiteam data_files: - split: train path: - "wiki/archive/v3/documents/*.jsonl.gz" - config_name: wikimedia data_files: - split: train path: - "wiki/dump/v1/documents/*.jsonl.gz" --- # Wiki Datasets ## Preprocessed versions of openly licensed wiki dumps collected by wikiteam and hosted on the Internet Archive. ## Version Descriptions * `raw`: The original wikitext * `v0`: Wikitext parsed to plain text with `wtf\_wikipedia` and conversion of math templates to LaTeX. * `v1`: Removal of some html snippets left behind during parsing. * `v2`: Removal of documents that basically just transcripts of non-openly licensed things. * `v3`: Removal of documents that basically lyrics for non-openly licensed things. Note: The `wikiteam3` scraping tool, used for most of the dumps, doesn't format edits to pages as `revisions` in the xml output, instead it creates new `pages`. Thus some documents in this dataset are earlier versions of various pages. For large edits this duplication can be benificial, but results in near-duplicate documents for small edits. Some sort of fuzzy deduping filter should be applied before using this dataset.

提供机构：

nkandpa2

5,000+

优质数据集

54 个

任务类型

进入经典数据集