hiimivantang/chunked-enwiki-ns0-20250301-enterprise-html
收藏Hugging Face2025-03-13 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hiimivantang/chunked-enwiki-ns0-20250301-enterprise-html
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含三个文件:20250301.jsonl 文件包含了20250301版本的维基百科HTML转储的预处理文本块;dead_letter_queue.jsonl 文件包含由于格式问题或标签缺失而无法处理的文章;en_redirection_map.pkl 文件是一个序列化文件,用于在每次运行预处理脚本时避免重建重定向映射。
The dataset consists of three files: 20250301.jsonl contains preprocessed text chunks from the Wikipedia HTML dump of 20250301; dead_letter_queue.jsonl contains articles that were not processed due to formatting issues or missing tags; en_redirection_map.pkl is a serialized file used to avoid rebuilding the redirection map every time the preprocessing script is executed.
提供机构:
hiimivantang



