five

OpenTextVault/OpenTextVault_FromTheNetwork

收藏
Hugging Face2025-11-09 更新2025-11-30 收录
下载链接:
https://hf-mirror.com/datasets/OpenTextVault/OpenTextVault_FromTheNetwork
下载链接
链接失效反馈
官方服务:
资源简介:
OpenTextVault_FromTheNetwork是一个开放、合法合规的高质量、大规模、多样化的原始文本数据集,包含了数百亿未标记的纯文本符号,适用于自然语言处理任务,如训练和微调大型语言模型以及语言学研究。特别地,该数据集包含大量未标记的高质量中文文本,来自新闻文章、文学作品、在线论坛和社交媒体讨论等多种来源,对于训练和微调中文或多种语言的大型语言模型特别有价值。

OpenTextVault_FromTheNetwork is an open, legally compliant high-quality, large-scale, and diverse raw text dataset containing hundreds of billions of unlabeled pure text tokens, suitable for natural language processing tasks such as training and fine-tuning large language models, as well as linguistic research. Specifically, the dataset includes a large volume of high-quality unlabeled Chinese text from various sources such as news articles, literature, online forums, and social media discussions, which is particularly valuable for training and fine-tuning large-scale Chinese or multilingual language models.
提供机构:
OpenTextVault
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作