nkandpa2/wiki-dolma
收藏Hugging Face2024-10-31 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/nkandpa2/wiki-dolma
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path:
- "wiki/archive/v3/documents/*.jsonl.gz"
- config_name: wikiteam
data_files:
- split: train
path:
- "wiki/archive/v3/documents/*.jsonl.gz"
- config_name: wikimedia
data_files:
- split: train
path:
- "wiki/dump/v1/documents/*.jsonl.gz"
---
# Wiki Datasets
##
Preprocessed versions of openly licensed wiki dumps collected by wikiteam and hosted on the Internet Archive.
## Version Descriptions
* `raw`: The original wikitext
* `v0`: Wikitext parsed to plain text with `wtf\_wikipedia` and conversion of math templates to LaTeX.
* `v1`: Removal of some html snippets left behind during parsing.
* `v2`: Removal of documents that basically just transcripts of non-openly licensed things.
* `v3`: Removal of documents that basically lyrics for non-openly licensed things.
Note: The `wikiteam3` scraping tool, used for most of the dumps, doesn't format edits to pages as `revisions` in the xml output, instead it creates new `pages`. Thus some documents in this dataset are earlier versions of various pages. For large edits this duplication can be benificial, but results in near-duplicate documents for small edits. Some sort of fuzzy deduping filter should be applied before using this dataset.
提供机构:
nkandpa2



