Fredtt3/LLaDA-Sample-10BT

Name: Fredtt3/LLaDA-Sample-10BT
Creator: Fredtt3
Published: 2025-07-16 02:55:12
License: 暂无描述

Hugging Face2025-07-16 更新2025-08-30 收录

下载链接：

https://hf-mirror.com/datasets/Fredtt3/LLaDA-Sample-10BT

下载链接

链接失效反馈

官方服务：

资源简介：

LLaDA-Sample-10BT数据集是基于HuggingFaceFW/fineweb的sample-10BT子集构建的，用于训练大型语言扩散模型LLaDA。数据集预处理使用了GSAI-ML/LLaDA-8B-Instruct分词器，将文本分块，每个块最多4096个token，并随机对1%的块进行了1到4096个token的尺寸设置。应用了噪声因子ε=1×10⁻³的噪声掩码。每个块包含的PyTorch张量字段有：input_ids、noisy_input_ids、mask和t（时间标量）。该数据集总共有约252万个块，分为252个.pt文件，每个文件约含有1万个块，平均文件大小约为702-708MB，总大小约为166GB。数据集被用于LLaDA-from-scratch GitHub仓库中的模型训练，其中包含完整的数据管道和训练脚本。

The LLaDA-Sample-10BT dataset is built based on the HuggingFaceFW/fineweb subset sample-10BT and is used for training the large language diffusion model LLaDA. The dataset preprocessing involves using the GSAI-ML/LLaDA-8B-Instruct tokenizer to chunk the text, with each chunk having a maximum of 4,096 tokens and 1% of chunks randomly sized between 1 and 4,096 tokens. Noisy masking with a noise factor ε=1×10⁻³ is applied. Each chunk includes PyTorch tensor fields: input_ids, noisy_input_ids, mask, and t (time scalar). The dataset consists of approximately 2,520,000 chunks, split into 252 .pt files, with each file containing about 10,000 chunks, and an average file size of about 702-708 MB, with a total size of approximately 166 GB. The dataset is used for model training in the LLaDA-from-scratch GitHub repository, which includes the complete data pipeline and training scripts.

提供机构：

Fredtt3

5,000+

优质数据集

54 个

任务类型

进入经典数据集