Ba2han/mixed_curated_pre-sft
收藏Hugging Face2025-12-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/mixed_curated_pre-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
datasets:
- Ba2han/mixed_curated
- Kyle1668/mcqa-midtraining-mix
- pthinc/turkish_english_general_dataset
language:
- en
- tr
---
# Mixed Curated Pre-SFT Dataset
This dataset is a compilation of multiple sources processed for pre-SFT/mid-training.
## Dataset Statistics
- **Total Rows:** 2,222,004
- **Total Tokens:** 1,664,903,224
- **Tokenizer:** `Ba2han/model-phase3` (BOS and EOS tokens explicitly added).
## Composition & Processing Methods
1. **Local MCQA**:
- Exploded into two main variations:
1. **Combined**: `text` + `\n` + `output` (Filtered < 50 chars).
2. **Separated**:
- `text` column isolated (Filtered < 15 chars).
- `output` column isolated, `<çeviri>`/`</çeviri>` tags removed (Filtered < 15 chars).
2. **Ba2han/mixed_curated**:
- Filtered out 50% of examples where `source == "Cosmos-2"`.
- Kept all other sources.
3. **Kyle1668/mcqa-midtraining-mix**:
- Streamed, shuffled, and sampled 80,000 examples.
4. **pthinc/turkish_english_general_dataset**:
- Augmented to create two variations per example: `text + prompt-answer` and `prompt-answer + text`.
- Removed the shortest 20% of the resulting dataset.
提供机构:
Ba2han



