TinyCorpus-v2
收藏魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/xTimeCrystal/TinyCorpus-v2
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MiniModel Pretraining Corpus
This dataset is a curated, tokenized pretraining mixture designed specifically for training **MiniModel**-series small language models. It was tokenized using the **Mistral-7B-Instruct-v0.3 tokenizer** (vocab size: 32,768), which is included in the [MiniModel-200M-Base repository](https://huggingface.co/xTimeCrystal/MiniModel-200M-Base).
For **training code**, **data loading utilities**, and full reproducibility (including the training script), see the official GitHub repository:
🔗 [https://github.com/xTimeCrystal/MiniModel/tree/main](https://github.com/xTimeCrystal/MiniModel/tree/main)
## Dataset Details
### Dataset Description
- **Curated by:** xTimeCrystal
- **Languages:** English, Chinese, Python (code)
- **License:** Apache 2.0
- **Intended use:** Pretraining efficient small language models (e.g., MiniModel-200M-Base)
- **Token count:** ~10 billion tokens
This corpus combines high-quality educational and general-purpose text sources, filtered and balanced to maximize learning efficiency in low-compute training regimes.
### Source Data Composition
The dataset is a weighted mixture of the following sources (by token count):
- **70%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (English subset)
- **20%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (Chinese subset)
- **5%** [`Avelina/python-edu-cleaned`](https://huggingface.co/datasets/Avelina/python-edu-cleaned)
- **5%** [`HuggingFaceTB/finemath`](https://huggingface.co/datasets/HuggingFaceTB/finemath)
All source datasets are publicly available and compatible with the Apache 2.0 license.
### Preprocessing
- Tokenized with the **Mistral-7B-Instruct-v0.3 tokenizer**
- Sequences were packed using a bin-packing algorithm to minimize padding (final padding < 5%)
- Maximum sequence length: 2048 tokens
- No deduplication beyond source-level filtering
> 💡 **Note**: The tokenizer, training configuration, and data-loading pipeline are provided in the [GitHub repo](https://github.com/xTimeCrystal/MiniModel/tree/main) for full reproducibility.
# MiniModel预训练语料数据集卡片
本数据集为经筛选与分词的预训练混合语料,专为**MiniModel**系列小型语言模型的训练打造。其分词采用**Mistral-7B-Instruct-v0.3分词器(Mistral-7B-Instruct-v0.3 tokenizer)**,词表大小为32768,该分词器已收录于[MiniModel-200M-Base仓库](https://huggingface.co/xTimeCrystal/MiniModel-200M-Base)。
如需获取**训练代码**、**数据加载工具**及完整复现所需内容(含训练脚本),请参阅官方GitHub仓库:🔗 [https://github.com/xTimeCrystal/MiniModel/tree/main](https://github.com/xTimeCrystal/MiniModel/tree/main)
## 数据集详情
### 数据集描述
- **制作方:** xTimeCrystal
- **支持语言:** 英语、中文、Python(代码)
- **开源协议:** Apache 2.0
- **适用场景:** 高效小型语言模型的预训练(如MiniModel-200M-Base)
- **Token总数:** 约100亿Token
该语料整合了优质教育类与通用文本资源,经过筛选与平衡处理,可在低算力训练场景下最大化学习效率。
### 源数据构成
本数据集为按Token占比加权混合的以下数据源(按Token数量排序):
- **70%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)(英语子集)
- **20%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)(中文子集)
- **5%** [`Avelina/python-edu-cleaned`](https://huggingface.co/datasets/Avelina/python-edu-cleaned)
- **5%** [`HuggingFaceTB/finemath`](https://huggingface.co/datasets/HuggingFaceTB/finemath)
所有源数据集均为公开可用,且兼容Apache 2.0开源协议。
### 预处理流程
- 采用**Mistral-7B-Instruct-v0.3分词器(Mistral-7B-Instruct-v0.3 tokenizer)**完成分词
- 采用装箱算法对序列进行打包,以最小化填充标记(最终填充占比<5%)
- 最大序列长度:2048 Token
- 仅在源数据层面进行过滤,未额外做去重处理
> 💡 **注意**:分词器、训练配置及数据加载流水线均已收录于[GitHub仓库](https://github.com/xTimeCrystal/MiniModel/tree/main),以支持完整复现。
提供机构:
maas
创建时间:
2025-09-25



