kp7742/YALM-pretrain6-62M

Name: kp7742/YALM-pretrain6-62M
Creator: kp7742
Published: 2025-07-13 20:49:03
License: 暂无描述

Hugging Face2025-07-13 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/kp7742/YALM-pretrain6-62M

下载链接

链接失效反馈

官方服务：

资源简介：

YALM预训练数据-6是一个包含英语、印地语、数学和Python代码的数据集，从不同来源收集，用于语言建模任务和YALM（Yet Another Language Model）的开发。总样本量为62M，大约有42B个token，使用2048上下文的样本打包。测试集包含10k样本。

The YALM Pretraining Data - 6 is a mixture of English, Hindi, Math, and Python Code collected from various sources for the Language modeling task and the development of YALM (Yet Another Language Model). It contains a total of 62M samples (~42B tokens with sample packing at 2048 Context). The test split includes 10k samples.

提供机构：

kp7742

5,000+

优质数据集

54 个

任务类型

进入经典数据集