five

TinyCorpus-v2

收藏
魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/xTimeCrystal/TinyCorpus-v2
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for MiniModel Pretraining Corpus This dataset is a curated, tokenized pretraining mixture designed specifically for training **MiniModel**-series small language models. It was tokenized using the **Mistral-7B-Instruct-v0.3 tokenizer** (vocab size: 32,768), which is included in the [MiniModel-200M-Base repository](https://huggingface.co/xTimeCrystal/MiniModel-200M-Base). For **training code**, **data loading utilities**, and full reproducibility (including the training script), see the official GitHub repository: 🔗 [https://github.com/xTimeCrystal/MiniModel/tree/main](https://github.com/xTimeCrystal/MiniModel/tree/main) ## Dataset Details ### Dataset Description - **Curated by:** xTimeCrystal - **Languages:** English, Chinese, Python (code) - **License:** Apache 2.0 - **Intended use:** Pretraining efficient small language models (e.g., MiniModel-200M-Base) - **Token count:** ~10 billion tokens This corpus combines high-quality educational and general-purpose text sources, filtered and balanced to maximize learning efficiency in low-compute training regimes. ### Source Data Composition The dataset is a weighted mixture of the following sources (by token count): - **70%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (English subset) - **20%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (Chinese subset) - **5%** [`Avelina/python-edu-cleaned`](https://huggingface.co/datasets/Avelina/python-edu-cleaned) - **5%** [`HuggingFaceTB/finemath`](https://huggingface.co/datasets/HuggingFaceTB/finemath) All source datasets are publicly available and compatible with the Apache 2.0 license. ### Preprocessing - Tokenized with the **Mistral-7B-Instruct-v0.3 tokenizer** - Sequences were packed using a bin-packing algorithm to minimize padding (final padding < 5%) - Maximum sequence length: 2048 tokens - No deduplication beyond source-level filtering > 💡 **Note**: The tokenizer, training configuration, and data-loading pipeline are provided in the [GitHub repo](https://github.com/xTimeCrystal/MiniModel/tree/main) for full reproducibility.

# MiniModel预训练语料数据集卡片 本数据集为经筛选与分词的预训练混合语料,专为**MiniModel**系列小型语言模型的训练打造。其分词采用**Mistral-7B-Instruct-v0.3分词器(Mistral-7B-Instruct-v0.3 tokenizer)**,词表大小为32768,该分词器已收录于[MiniModel-200M-Base仓库](https://huggingface.co/xTimeCrystal/MiniModel-200M-Base)。 如需获取**训练代码**、**数据加载工具**及完整复现所需内容(含训练脚本),请参阅官方GitHub仓库:🔗 [https://github.com/xTimeCrystal/MiniModel/tree/main](https://github.com/xTimeCrystal/MiniModel/tree/main) ## 数据集详情 ### 数据集描述 - **制作方:** xTimeCrystal - **支持语言:** 英语、中文、Python(代码) - **开源协议:** Apache 2.0 - **适用场景:** 高效小型语言模型的预训练(如MiniModel-200M-Base) - **Token总数:** 约100亿Token 该语料整合了优质教育类与通用文本资源,经过筛选与平衡处理,可在低算力训练场景下最大化学习效率。 ### 源数据构成 本数据集为按Token占比加权混合的以下数据源(按Token数量排序): - **70%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)(英语子集) - **20%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)(中文子集) - **5%** [`Avelina/python-edu-cleaned`](https://huggingface.co/datasets/Avelina/python-edu-cleaned) - **5%** [`HuggingFaceTB/finemath`](https://huggingface.co/datasets/HuggingFaceTB/finemath) 所有源数据集均为公开可用,且兼容Apache 2.0开源协议。 ### 预处理流程 - 采用**Mistral-7B-Instruct-v0.3分词器(Mistral-7B-Instruct-v0.3 tokenizer)**完成分词 - 采用装箱算法对序列进行打包,以最小化填充标记(最终填充占比<5%) - 最大序列长度:2048 Token - 仅在源数据层面进行过滤,未额外做去重处理 > 💡 **注意**:分词器、训练配置及数据加载流水线均已收录于[GitHub仓库](https://github.com/xTimeCrystal/MiniModel/tree/main),以支持完整复现。
提供机构:
maas
创建时间:
2025-09-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作