RadiCat/wiki_pretrain
收藏Hugging Face2025-08-19 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/RadiCat/wiki_pretrain
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含中文、英文和日语的百科和维基数据,用于预训练LLM Quirel模型。中文部分包括经过Markdown格式整理的百度百科数据和维基百科数据,都经过了一定的清洗处理,如去除过多网络链接的段落、过滤过短文本等。
The dataset includes Chinese, English, and Japanese encyclopedic and Wikipedia data for pre-training the LLM Quirel model. The Chinese section consists of Markdown-formatted Baidu Encyclopedia data and Wikipedia data, both of which have undergone certain cleaning procedures such as removing paragraphs with many network links and filtering out short texts.
提供机构:
RadiCat



