kp7742/YALM-pretrain6-62M
收藏Hugging Face2025-07-13 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/kp7742/YALM-pretrain6-62M
下载链接
链接失效反馈官方服务:
资源简介:
YALM预训练数据-6是一个包含英语、印地语、数学和Python代码的数据集,从不同来源收集,用于语言建模任务和YALM(Yet Another Language Model)的开发。总样本量为62M,大约有42B个token,使用2048上下文的样本打包。测试集包含10k样本。
The YALM Pretraining Data - 6 is a mixture of English, Hindi, Math, and Python Code collected from various sources for the Language modeling task and the development of YALM (Yet Another Language Model). It contains a total of 62M samples (~42B tokens with sample packing at 2048 Context). The test split includes 10k samples.
提供机构:
kp7742



