TinyCorpus-v2

Name: TinyCorpus-v2
Creator: maas
Published: 2025-11-27 16:49:41
License: 暂无描述

魔搭社区2025-11-27 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/xTimeCrystal/TinyCorpus-v2

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for MiniModel Pretraining Corpus This dataset is a curated, tokenized pretraining mixture designed specifically for training **MiniModel**-series small language models. It was tokenized using the **Mistral-7B-Instruct-v0.3 tokenizer** (vocab size: 32,768), which is included in the [MiniModel-200M-Base repository](https://huggingface.co/xTimeCrystal/MiniModel-200M-Base). For **training code**, **data loading utilities**, and full reproducibility (including the training script), see the official GitHub repository: 🔗 [https://github.com/xTimeCrystal/MiniModel/tree/main](https://github.com/xTimeCrystal/MiniModel/tree/main) ## Dataset Details ### Dataset Description - **Curated by:** xTimeCrystal - **Languages:** English, Chinese, Python (code) - **License:** Apache 2.0 - **Intended use:** Pretraining efficient small language models (e.g., MiniModel-200M-Base) - **Token count:** ~10 billion tokens This corpus combines high-quality educational and general-purpose text sources, filtered and balanced to maximize learning efficiency in low-compute training regimes. ### Source Data Composition The dataset is a weighted mixture of the following sources (by token count): - **70%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (English subset) - **20%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) (Chinese subset) - **5%** [`Avelina/python-edu-cleaned`](https://huggingface.co/datasets/Avelina/python-edu-cleaned) - **5%** [`HuggingFaceTB/finemath`](https://huggingface.co/datasets/HuggingFaceTB/finemath) All source datasets are publicly available and compatible with the Apache 2.0 license. ### Preprocessing - Tokenized with the **Mistral-7B-Instruct-v0.3 tokenizer** - Sequences were packed using a bin-packing algorithm to minimize padding (final padding < 5%) - Maximum sequence length: 2048 tokens - No deduplication beyond source-level filtering > 💡 **Note**: The tokenizer, training configuration, and data-loading pipeline are provided in the [GitHub repo](https://github.com/xTimeCrystal/MiniModel/tree/main) for full reproducibility.

# MiniModel预训练语料数据集卡片本数据集为经筛选与分词的预训练混合语料，专为**MiniModel**系列小型语言模型的训练打造。其分词采用**Mistral-7B-Instruct-v0.3分词器（Mistral-7B-Instruct-v0.3 tokenizer）**，词表大小为32768，该分词器已收录于[MiniModel-200M-Base仓库](https://huggingface.co/xTimeCrystal/MiniModel-200M-Base)。如需获取**训练代码**、**数据加载工具**及完整复现所需内容（含训练脚本），请参阅官方GitHub仓库：🔗 [https://github.com/xTimeCrystal/MiniModel/tree/main](https://github.com/xTimeCrystal/MiniModel/tree/main) ## 数据集详情 ### 数据集描述 - **制作方：** xTimeCrystal - **支持语言：** 英语、中文、Python（代码） - **开源协议：** Apache 2.0 - **适用场景：** 高效小型语言模型的预训练（如MiniModel-200M-Base） - **Token总数：** 约100亿Token 该语料整合了优质教育类与通用文本资源，经过筛选与平衡处理，可在低算力训练场景下最大化学习效率。 ### 源数据构成本数据集为按Token占比加权混合的以下数据源（按Token数量排序）： - **70%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)（英语子集） - **20%** [`openbmb/Ultra-FineWeb`](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)（中文子集） - **5%** [`Avelina/python-edu-cleaned`](https://huggingface.co/datasets/Avelina/python-edu-cleaned) - **5%** [`HuggingFaceTB/finemath`](https://huggingface.co/datasets/HuggingFaceTB/finemath) 所有源数据集均为公开可用，且兼容Apache 2.0开源协议。 ### 预处理流程 - 采用**Mistral-7B-Instruct-v0.3分词器（Mistral-7B-Instruct-v0.3 tokenizer）**完成分词 - 采用装箱算法对序列进行打包，以最小化填充标记（最终填充占比<5%） - 最大序列长度：2048 Token - 仅在源数据层面进行过滤，未额外做去重处理 > 💡 **注意**：分词器、训练配置及数据加载流水线均已收录于[GitHub仓库](https://github.com/xTimeCrystal/MiniModel/tree/main)，以支持完整复现。

提供机构：

maas

创建时间：

2025-09-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集