r-three/tokenizer_training
收藏Hugging Face2025-08-07 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/r-three/tokenizer_training
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了多种语言的文本数据,分别用于训练和验证。具体包括stack_edu(可能是教育领域的文本数据)、fas_Arab(可能是阿拉伯语的文本数据)、ita_Latn(可能是意大利语的文本数据)、tur_Latn(可能是土耳其语的文本数据)、cmn_Hani(可能是简体中文的文本数据)以及fw_edu(可能是教育领域的文本数据)。
The dataset includes text data in various languages for training and validation. It specifically contains stack_edu (possibly educational text data), fas_Arab (possibly Arabic text data), ita_Latn (possibly Italian text data), tur_Latn (possibly Turkish text data), cmn_Hani (possibly Simplified Chinese text data), and fw_edu (possibly educational text data).
提供机构:
r-three



