five

cion-ai/slm-packed-exp-01

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cion-ai/slm-packed-exp-01
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids dtype: int32 list: true - name: labels dtype: int32 list: true splits: - name: train num_bytes: 792858299 num_examples: 48359 - name: validation num_bytes: 7592492 num_examples: 463 download_size: 800450791 dataset_size: 800450791 --- # slm-packed-exp-01 ## Dataset Summary Pre-tokenized, packed, and shuffled dataset for Small Language Model (SLM) training. Designed for zero-overhead training with no runtime tokenization or masking required. ## Dataset Details - **Tokenizer**: `gpt2` - **Vocabulary Size**: 50257 - **Sequence Length**: 2048 tokens - **Total Tokens**: ~99,987,456 - **Train Sequences**: 48,359 - **Validation Sequences**: 463 - **Train/Val Split**: 99.1% / 0.9% - **Shuffle Seed**: 42 - **Packing Strategy**: EOS-aware concatenation (multiple documents per sequence) - **Masking**: None required (standard causal LM with labels=input_ids) ## Source Datasets | Dataset | Subset | Weight | Text Column | |---------|--------|--------|-------------| | HuggingFaceFW/fineweb | sample-10BT | 60.0% | text | | roneneldan/TinyStories | default | 40.0% | text | ## Creation Date 2026-04-16 17:36:52 UTC ## Training Usage This dataset is designed for direct use with Hugging Face `Trainer` without any preprocessing: ```python from datasets import load_dataset from transformers import Trainer, TrainingArguments dataset = load_dataset("cion-ai/slm-packed-exp-01") training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=32, remove_unused_columns=False, # Keep input_ids and labels ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) ``` ## Notes - **No attention_mask needed**: All sequences are fully packed (no padding) - **No loss_mask needed**: Standard causal LM loss works directly - **EOS tokens**: Present at document boundaries for context separation - **Shuffled**: Both splits shuffled with seed=42 for reproducibility ## License Inherited from source datasets. Please verify individual source licenses. ## Citation If you use this dataset, please cite the source datasets appropriately.
提供机构:
cion-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作