cion-ai/slm-packed-exp-01

Name: cion-ai/slm-packed-exp-01
Creator: cion-ai
Published: 2026-04-16 17:37:01
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/cion-ai/slm-packed-exp-01

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: input_ids dtype: int32 list: true - name: labels dtype: int32 list: true splits: - name: train num_bytes: 792858299 num_examples: 48359 - name: validation num_bytes: 7592492 num_examples: 463 download_size: 800450791 dataset_size: 800450791 --- # slm-packed-exp-01 ## Dataset Summary Pre-tokenized, packed, and shuffled dataset for Small Language Model (SLM) training. Designed for zero-overhead training with no runtime tokenization or masking required. ## Dataset Details - **Tokenizer**: `gpt2` - **Vocabulary Size**: 50257 - **Sequence Length**: 2048 tokens - **Total Tokens**: ~99,987,456 - **Train Sequences**: 48,359 - **Validation Sequences**: 463 - **Train/Val Split**: 99.1% / 0.9% - **Shuffle Seed**: 42 - **Packing Strategy**: EOS-aware concatenation (multiple documents per sequence) - **Masking**: None required (standard causal LM with labels=input_ids) ## Source Datasets | Dataset | Subset | Weight | Text Column | |---------|--------|--------|-------------| | HuggingFaceFW/fineweb | sample-10BT | 60.0% | text | | roneneldan/TinyStories | default | 40.0% | text | ## Creation Date 2026-04-16 17:36:52 UTC ## Training Usage This dataset is designed for direct use with Hugging Face `Trainer` without any preprocessing: ```python from datasets import load_dataset from transformers import Trainer, TrainingArguments dataset = load_dataset("cion-ai/slm-packed-exp-01") training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=32, remove_unused_columns=False, # Keep input_ids and labels ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) ``` ## Notes - **No attention_mask needed**: All sequences are fully packed (no padding) - **No loss_mask needed**: Standard causal LM loss works directly - **EOS tokens**: Present at document boundaries for context separation - **Shuffled**: Both splits shuffled with seed=42 for reproducibility ## License Inherited from source datasets. Please verify individual source licenses. ## Citation If you use this dataset, please cite the source datasets appropriately.

提供机构：

cion-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集