OpenTextVault/OpenTextVault_FromTheNetwork
收藏Hugging Face2025-11-09 更新2025-11-30 收录
下载链接:
https://hf-mirror.com/datasets/OpenTextVault/OpenTextVault_FromTheNetwork
下载链接
链接失效反馈官方服务:
资源简介:
OpenTextVault_FromTheNetwork是一个开放、合法合规的高质量、大规模、多样化的原始文本数据集,包含了数百亿未标记的纯文本符号,适用于自然语言处理任务,如训练和微调大型语言模型以及语言学研究。特别地,该数据集包含大量未标记的高质量中文文本,来自新闻文章、文学作品、在线论坛和社交媒体讨论等多种来源,对于训练和微调中文或多种语言的大型语言模型特别有价值。
OpenTextVault_FromTheNetwork is an open, legally compliant high-quality, large-scale, and diverse raw text dataset containing hundreds of billions of unlabeled pure text tokens, suitable for natural language processing tasks such as training and fine-tuning large language models, as well as linguistic research. Specifically, the dataset includes a large volume of high-quality unlabeled Chinese text from various sources such as news articles, literature, online forums, and social media discussions, which is particularly valuable for training and fine-tuning large-scale Chinese or multilingual language models.
提供机构:
OpenTextVault



