five

Ba2han/mixed_curated_pre-sft

收藏
Hugging Face2025-12-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ba2han/mixed_curated_pre-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit datasets: - Ba2han/mixed_curated - Kyle1668/mcqa-midtraining-mix - pthinc/turkish_english_general_dataset language: - en - tr --- # Mixed Curated Pre-SFT Dataset This dataset is a compilation of multiple sources processed for pre-SFT/mid-training. ## Dataset Statistics - **Total Rows:** 2,222,004 - **Total Tokens:** 1,664,903,224 - **Tokenizer:** `Ba2han/model-phase3` (BOS and EOS tokens explicitly added). ## Composition & Processing Methods 1. **Local MCQA**: - Exploded into two main variations: 1. **Combined**: `text` + `\n` + `output` (Filtered < 50 chars). 2. **Separated**: - `text` column isolated (Filtered < 15 chars). - `output` column isolated, `<çeviri>`/`</çeviri>` tags removed (Filtered < 15 chars). 2. **Ba2han/mixed_curated**: - Filtered out 50% of examples where `source == "Cosmos-2"`. - Kept all other sources. 3. **Kyle1668/mcqa-midtraining-mix**: - Streamed, shuffled, and sampled 80,000 examples. 4. **pthinc/turkish_english_general_dataset**: - Augmented to create two variations per example: `text + prompt-answer` and `prompt-answer + text`. - Removed the shortest 20% of the resulting dataset.
提供机构:
Ba2han
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作