kp7742/YALM-pretrain4-128M

Name: kp7742/YALM-pretrain4-128M
Creator: kp7742
Published: 2025-06-07 11:10:39
License: 暂无描述

Hugging Face2025-06-07 更新2025-08-30 收录

下载链接：

https://hf-mirror.com/datasets/kp7742/YALM-pretrain4-128M

下载链接

链接失效反馈

官方服务：

资源简介：

YALM预训练数据-4是一个包含英语、印地语、数学和Python代码的数据集，用于语言建模任务和YALM（Yet Another Language Model）的开发。总样本量为128M（约256B个token，2048个上下文）。测试集包含2k个样本。数据集包括70%的英语（约89.60M），20%的印地语（约25.60M），5%的数学（约6.40M）和5%的Python代码（约6.40M）。

The YALM Pretraining Data - 4 is a mix of English, Hindi, Math, and Python Code intended for language modeling tasks and the development of YALM (Yet Another Language Model). It contains a total of 128M samples (~256B tokens at 2048 context). The test split includes 2k samples. The dataset is composed of 70% English (~89.60M), 20% Hindi (~25.60M), 5% Math (~6.40M), and 5% Python Code (~6.40M).

提供机构：

kp7742

5,000+

优质数据集

54 个

任务类型

进入经典数据集