hotchpotch/fineweb-2-edu-japanese
收藏Hugging Face2025-05-09 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/fineweb-2-edu-japanese
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb2 Edu Japanese数据集是从FineWeb2中筛选出的约1.2亿篇教育类文本,总共有约8930亿个token。提供了四个子集:default(约1.2亿文本),sample_10BT(default子集的随机样本,约10B tokens),small_tokens(只包含512个tokens或更少的文本),small_tokens_cleaned(small_tokens子集的清洁版本,移除了Web特定的噪声)。数据集使用了特定模型进行噪声移除和文本分类,并且分为训练集和测试集。
The FineWeb2 Edu Japanese dataset consists of approximately 120 million educational texts filtered from FineWeb2, totaling about 89.3 billion tokens. It provides four subsets: default (approximately 120 million texts), sample_10BT (a random sample of about 10B tokens from the default subset), small_tokens (data composed solely of texts with 512 tokens or fewer), and small_tokens_cleaned (the cleaned version of small_tokens with Web-specific noise removed). The dataset uses specific models for noise removal and text classification, and is divided into training and test sets.
提供机构:
hotchpotch



