qingy2024/qwark-corpus
收藏Hugging Face2025-01-09 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/qingy2024/qwark-corpus
下载链接
链接失效反馈官方服务:
资源简介:
Qwark语料库包含了超过13亿个来自互联网的高质量tokens,基于HuggingFaceTB/smollm-corpus的fineweb-edu-dedup子集和FineMath-4+构建而成。数据集经过严格过滤,确保了样本的质量和多样性,适用于自然语言处理研究和应用。
The Qwark Corpus consists of over 1.3 billion high-quality tokens from the internet, built upon the fineweb-edu-dedup subset of HuggingFaceTB/smollm-corpus and FineMath-4+. The dataset has undergone a rigorous filtering process to ensure the quality and diversity of the samples, making it suitable for natural language processing research and applications.
提供机构:
qingy2024



