cion-ai/slm-packed-exp-01
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cion-ai/slm-packed-exp-01
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
dtype: int32
list: true
- name: labels
dtype: int32
list: true
splits:
- name: train
num_bytes: 792858299
num_examples: 48359
- name: validation
num_bytes: 7592492
num_examples: 463
download_size: 800450791
dataset_size: 800450791
---
# slm-packed-exp-01
## Dataset Summary
Pre-tokenized, packed, and shuffled dataset for Small Language Model (SLM) training.
Designed for zero-overhead training with no runtime tokenization or masking required.
## Dataset Details
- **Tokenizer**: `gpt2`
- **Vocabulary Size**: 50257
- **Sequence Length**: 2048 tokens
- **Total Tokens**: ~99,987,456
- **Train Sequences**: 48,359
- **Validation Sequences**: 463
- **Train/Val Split**: 99.1% / 0.9%
- **Shuffle Seed**: 42
- **Packing Strategy**: EOS-aware concatenation (multiple documents per sequence)
- **Masking**: None required (standard causal LM with labels=input_ids)
## Source Datasets
| Dataset | Subset | Weight | Text Column |
|---------|--------|--------|-------------|
| HuggingFaceFW/fineweb | sample-10BT | 60.0% | text |
| roneneldan/TinyStories | default | 40.0% | text |
## Creation Date
2026-04-16 17:36:52 UTC
## Training Usage
This dataset is designed for direct use with Hugging Face `Trainer` without any preprocessing:
```python
from datasets import load_dataset
from transformers import Trainer, TrainingArguments
dataset = load_dataset("cion-ai/slm-packed-exp-01")
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=32,
remove_unused_columns=False, # Keep input_ids and labels
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
```
## Notes
- **No attention_mask needed**: All sequences are fully packed (no padding)
- **No loss_mask needed**: Standard causal LM loss works directly
- **EOS tokens**: Present at document boundaries for context separation
- **Shuffled**: Both splits shuffled with seed=42 for reproducibility
## License
Inherited from source datasets. Please verify individual source licenses.
## Citation
If you use this dataset, please cite the source datasets appropriately.
提供机构:
cion-ai



