harithoppil/minimind_dataset
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/harithoppil/minimind_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: sft
default: true
description: "Supervised Fine-Tuning data (Conversational Schema)"
data_files:
- split: english
path:
- "sft_mini_512_en.jsonl"
- "sft_512_en.jsonl"
- "sft_1024_en.jsonl"
- "sft_2048_en.jsonl"
- "sft_data_en.jsonl"
- split: chinese
path:
- "sft_mini_512.jsonl"
- "sft_512.jsonl"
- "sft_1024.jsonl"
- "sft_2048.jsonl"
- config_name: pretrain
description: "Pretraining data (Text Schema)"
data_files:
- split: english
path: "pretrain_hq_en.jsonl"
- split: chinese
path: "pretrain_hq.jsonl"
- config_name: dpo
description: "Preference Optimization data (Chosen/Rejected Schema)"
data_files:
- split: english
path: "dpo_en.jsonl"
- split: chinese
path: "dpo.jsonl"
- config_name: reasoning
description: "Reasoning / Chain-of-Thought data (Conversational Schema)"
data_files:
- split: english
path: "r1_mix_1024_en.jsonl"
- split: chinese
path: "r1_mix_1024.jsonl"
- config_name: rlaif
description: "RL from AI Feedback data"
data_files:
- split: chinese
path: "rlaif-mini.jsonl"
- config_name: lora
description: "Domain specific / Identity data for LoRA"
data_files:
- split: chinese
path:
- "lora_identity.jsonl"
- "lora_medical.jsonl"
---
<div align="center">

</div>
<div align="center">

[](https://github.com/jingyaogong/minimind/stargazers)
[](LICENSE)
[](https://github.com/jingyaogong/minimind/commits/master)
[](https://github.com/jingyaogong/minimind/pulls)
[](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5)
</div>
# 📌 Data Overview
## Ⅰ Tokenizer
A tokenizer maps words from natural language to numbers like `0, 1, 36` through a “vocabulary”. You can think of each number as the page index of a word in a “dictionary”.
You may choose to build your own vocabulary and train a tokenizer. The code can be found in `./scripts/train_tokenizer.py` (for learning reference only; unless necessary, there is no need to retrain one yourself, as MiniMind already comes with a tokenizer).
Alternatively, you can choose well-known open-source large-model tokenizers.
Just like using a Xinhua or Oxford dictionary: the advantage is excellent token compression efficiency, but the downside is that the vocabulary size is huge, often reaching hundreds of thousands of words and phrases.
For a self-trained tokenizer, the advantage is that the vocabulary size and content can be freely controlled, but the disadvantage is very poor compression efficiency (for example, `"hello"` might be split into five separate tokens `"h e l l o"`), and rare words are difficult to cover.
The choice of “dictionary” is certainly important. The output of an LLM is essentially a multi-class classification problem over N vocabulary tokens via SoftMax, and then decoded back into natural language through the “dictionary”.
Because MiniMind’s size must be strictly controlled, in order to avoid a top-heavy model (where embedding parameters take up too large a proportion of the LLM), the vocabulary size should be as small as possible.
<details style="color:rgb(128,128,128)">
<summary>Tokenizer Introduction</summary>
The vocabulary sizes of some powerful third-party open-source model tokenizers such as Yi, Qwen, ChatGLM, Mistral, and Llama3 are as follows:
<table>
<tr><th>Tokenizer Model</th><th>Vocabulary Size</th><th>Source</th></tr>
<tr><td>yi tokenizer</td><td>64,000</td><td>01.ai (China)</td></tr>
<tr><td>qwen2 tokenizer</td><td>151,643</td><td>Alibaba Cloud (China)</td></tr>
<tr><td>glm tokenizer</td><td>151,329</td><td>Zhipu AI (China)</td></tr>
<tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AI (France)</td></tr>
<tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta (USA)</td></tr>
<tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr>
</table>
> 👉 Update 2024-09-17: To avoid ambiguity in previous versions and control model size, all MiniMind models now use the minimind_tokenizer. All mistral_tokenizer versions have been deprecated.
```text
# Some self-talk
> Although minimind_tokenizer has a very small vocabulary size, its encoding/decoding efficiency is weaker than Chinese-friendly tokenizers such as qwen2 and glm.
> However, MiniMind chooses its self-trained minimind_tokenizer to keep overall parameters lightweight and avoid imbalance between the embedding layer and computation layers (a top-heavy model), because MiniMind’s vocabulary size is only 6,400.
> In actual testing, MiniMind has never encountered decoding failures for rare words, and the results are good.
> By compressing the custom vocabulary size to 6,400, the total parameter count of the LLM can be as low as 25.8M.
> The training data `tokenizer_train.jsonl` all comes from the `Jiangshu Large Model Dataset`. This part of the data is relatively secondary; if you need to train, you may freely choose your own dataset.
```
</details>
## Ⅱ Pretrain Data
After the lesson learned from MiniMind-V1, where low-quality pretraining data caused the model to produce nonsense, it was decided after `2025-02-05` to no longer use large-scale unsupervised datasets for pretraining.
Instead, the Chinese portion of the [Jiangshu Large Model Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data) was extracted, cleaned, and filtered to samples with character length `<512`. About 1.6GB of corpus was directly concatenated into the pretraining dataset `pretrain_hq.jsonl`, where `hq` stands for high quality (though it is still not truly “high”—improving data quality is endless).
The file `pretrain_hq.jsonl` has the following data format:
```bash
{"text": "How can one get rid of procrastination? Curing procrastination is not easy, but the following suggestions may help..."}
```
## Ⅲ SFT Data
[Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data)
“This is a complete, well-formatted, and safe resource for large-model training and research.
It collects and organizes a large number of open-source datasets from public online sources, unifies their formats, and performs data cleaning.
It contains a 10M-sample Chinese dataset and a 2M-sample English dataset.”
The above is the official description. After downloading, the total data volume is about 4B tokens, which is certainly suitable as SFT data for a Chinese LLM. However, the official data format is very messy, and using all of it for SFT would be too costly.
I performed a second round of cleaning on the official dataset, removing entries with symbol pollution and noise. In addition, only content with total length `<512` was retained. At this stage, the goal is to supplement the knowledge lacking in the pretraining stage through a large number of conversations. The exported file is `sft_512.jsonl` (~7.5GB).
[Magpie-SFT Dataset](https://www.modelscope.cn/organization/Magpie-Align)
This dataset collects ~1M high-quality conversations from Qwen2/2.5. I further cleaned this portion and exported samples with total length `<2048` to `sft_2048.jsonl` (~9GB). Samples with length `<1024` were exported to `sft_1024.jsonl` (~5.5GB). Using large-model dialogue data directly for SFT falls into the category of “black-box distillation”.
Further cleaning of the first two SFT datasets (keeping only content with a high proportion of Chinese characters) and filtering conversations with length `<512` yields `sft_mini_512.jsonl` (~1.2GB).
All SFT files `sft_X.jsonl` share the following data format:
```text
{
"conversations": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello!"},
{"role": "user", "content": "Goodbye"},
{"role": "assistant", "content": "Goodbye!"}
]
}
```
## Ⅳ RLHF Data
From the [Magpie-DPO Dataset](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1)
Approximately 200k preference samples (all in English) generated by Llama3.1-70B/8B, which can be used to train a reward model and optimize response quality to better align with human preferences.
Samples with total length `<3000` were reorganized into `dpo.jsonl` (~0.9GB), containing two fields: `chosen` and `rejected`. `chosen` represents the preferred response, while `rejected` represents the rejected response.
The file `dpo.jsonl` has the following format:
```text
{
"chosen": [
{"content": "Q", "role": "user"},
{"content": "good answer", "role": "assistant"}
],
"rejected": [
{"content": "Q", "role": "user"},
{"content": "bad answer", "role": "assistant"}
]
}
```
## Ⅴ Reasoning Dataset
It has to be said that in February 2025, nothing was hotter than DeepSeek...
This also sparked my strong interest in RL-guided reasoning models. I have already reproduced R1-Zero using Qwen2.5.
If time allows and the results work (though there is a 99% chance the base model capability is insufficient), I will later update MiniMind with an RL-trained reasoning model rather than a distilled one.
Given limited time, the fastest low-cost solution is still direct distillation (black-box approach).
With R1 becoming extremely popular, within just a few days several R1 distillation datasets already appeared, such as
[R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B),
[R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT),
[Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH),
[deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh), etc. Pure Chinese data is relatively scarce.
These were ultimately merged and exported as `r1_mix_1024.jsonl`, with the same data format as `sft_X.jsonl`.
## Ⅵ More Datasets
Currently, [HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM)
is collecting and organizing open-source Chinese LLM-related models, applications, datasets, and tutorials, and continuously updating the latest progress in this area. Comprehensive and professional. Respect!
---
## Ⅷ Dataset Download
> [!NOTE]
> After 2025-02-05, all datasets used for the final training of open-source MiniMind are provided, so there is no need to preprocess large-scale datasets yourself, avoiding redundant data processing work.
MiniMind training datasets
([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong))
> No need to clone everything; you can download only the files you need.
Place the downloaded dataset files into the `./dataset/` directory (✨ indicates recommended required items):
```bash
./dataset/
├── dpo.jsonl (909MB)
├── lora_identity.jsonl (22.8KB)
├── lora_medical.jsonl (34MB)
├── pretrain_hq.jsonl (1.6GB, ✨)
├── r1_mix_1024.jsonl (340MB)
├── sft_1024.jsonl (5.6GB)
├── sft_2048.jsonl (9GB)
├── sft_512.jsonl (7.5GB)
├── sft_mini_512.jsonl (1.2GB, ✨)
└── tokenizer_train.jsonl (1GB)
```
<details style="color:rgb(128,128,128)">
<summary>Note: Dataset Descriptions</summary>
* `dpo.jsonl` — RLHF stage dataset
* `lora_identity.jsonl` — self-identity dataset (e.g., “Who are you?” “I am MiniMind…”), recommended for LoRA training (can also be used for full-parameter SFT)
* `lora_medical.jsonl` — medical Q&A dataset, recommended for LoRA training (can also be used for full-parameter SFT)
* `pretrain_hq.jsonl` ✨ — pretraining dataset, integrated from Jiangshu Technology
* `r1_mix_1024.jsonl` — DeepSeek-R1-1.5B distilled dataset, max length per sample is 1024 (set `max_seq_len=1024` during training)
* `sft_1024.jsonl` — integrated from Qwen2.5 distillation data (subset of sft_2048), max length per sample is 1024
* `sft_2048.jsonl` — integrated from Qwen2.5 distillation data, max length per sample is 2048
* `sft_512.jsonl` — integrated from Jiangshu Technology SFT data, max length per sample is 512
* `sft_mini_512.jsonl` ✨ — minimal integration of Jiangshu Technology SFT data + Qwen2.5 distillation data (for fast Zero-model training), max length per sample is 512
* `tokenizer_train.jsonl` — all from the `Jiangshu Large Model Dataset`; relatively secondary (retraining the tokenizer is not recommended, as explained above)
</details>
<details style="color:rgb(128,128,128)">
<summary>Notes & Recommended Training Plans</summary>
* All MiniMind2 Series models were trained on a total of ~20GB of corpus, about 4B tokens, corresponding to the data combination above (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊)
* To achieve a Zero model from scratch as fast as possible, it is recommended to use the data combination `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. Refer to the table below for specific cost and effectiveness (Cost: 💰, Effect: 😊😊)
* Users with sufficient compute resources or who care more about performance are recommended to fully reproduce MiniMind2; users with only a single GPU or who care about fast reproduction in a short time are strongly recommended to use the latter approach
* [Compromise Plan] You may also freely combine medium-scale datasets such as `sft_mini_512.jsonl` and `sft_1024.jsonl` for training (Cost: 💰💰💰, Effect: 😊😊😊😊).
</details>
提供机构:
harithoppil



