five

nkandpa2/wiki-dolma

收藏
Hugging Face2024-10-31 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/nkandpa2/wiki-dolma
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: - "wiki/archive/v3/documents/*.jsonl.gz" - config_name: wikiteam data_files: - split: train path: - "wiki/archive/v3/documents/*.jsonl.gz" - config_name: wikimedia data_files: - split: train path: - "wiki/dump/v1/documents/*.jsonl.gz" --- # Wiki Datasets ## Preprocessed versions of openly licensed wiki dumps collected by wikiteam and hosted on the Internet Archive. ## Version Descriptions * `raw`: The original wikitext * `v0`: Wikitext parsed to plain text with `wtf\_wikipedia` and conversion of math templates to LaTeX. * `v1`: Removal of some html snippets left behind during parsing. * `v2`: Removal of documents that basically just transcripts of non-openly licensed things. * `v3`: Removal of documents that basically lyrics for non-openly licensed things. Note: The `wikiteam3` scraping tool, used for most of the dumps, doesn't format edits to pages as `revisions` in the xml output, instead it creates new `pages`. Thus some documents in this dataset are earlier versions of various pages. For large edits this duplication can be benificial, but results in near-duplicate documents for small edits. Some sort of fuzzy deduping filter should be applied before using this dataset.
提供机构:
nkandpa2
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作