five

Fredtt3/LLaDA-Sample-10BT

收藏
Hugging Face2025-07-16 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/Fredtt3/LLaDA-Sample-10BT
下载链接
链接失效反馈
官方服务:
资源简介:
LLaDA-Sample-10BT数据集是基于HuggingFaceFW/fineweb的sample-10BT子集构建的,用于训练大型语言扩散模型LLaDA。数据集预处理使用了GSAI-ML/LLaDA-8B-Instruct分词器,将文本分块,每个块最多4096个token,并随机对1%的块进行了1到4096个token的尺寸设置。应用了噪声因子ε=1×10⁻³的噪声掩码。每个块包含的PyTorch张量字段有:input_ids、noisy_input_ids、mask和t(时间标量)。该数据集总共有约252万个块,分为252个.pt文件,每个文件约含有1万个块,平均文件大小约为702-708MB,总大小约为166GB。数据集被用于LLaDA-from-scratch GitHub仓库中的模型训练,其中包含完整的数据管道和训练脚本。

The LLaDA-Sample-10BT dataset is built based on the HuggingFaceFW/fineweb subset sample-10BT and is used for training the large language diffusion model LLaDA. The dataset preprocessing involves using the GSAI-ML/LLaDA-8B-Instruct tokenizer to chunk the text, with each chunk having a maximum of 4,096 tokens and 1% of chunks randomly sized between 1 and 4,096 tokens. Noisy masking with a noise factor ε=1×10⁻³ is applied. Each chunk includes PyTorch tensor fields: input_ids, noisy_input_ids, mask, and t (time scalar). The dataset consists of approximately 2,520,000 chunks, split into 252 .pt files, with each file containing about 10,000 chunks, and an average file size of about 702-708 MB, with a total size of approximately 166 GB. The dataset is used for model training in the LLaDA-from-scratch GitHub repository, which includes the complete data pipeline and training scripts.
提供机构:
Fredtt3
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作