five

harithoppil/minimind_dataset

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/harithoppil/minimind_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: sft default: true description: "Supervised Fine-Tuning data (Conversational Schema)" data_files: - split: english path: - "sft_mini_512_en.jsonl" - "sft_512_en.jsonl" - "sft_1024_en.jsonl" - "sft_2048_en.jsonl" - "sft_data_en.jsonl" - split: chinese path: - "sft_mini_512.jsonl" - "sft_512.jsonl" - "sft_1024.jsonl" - "sft_2048.jsonl" - config_name: pretrain description: "Pretraining data (Text Schema)" data_files: - split: english path: "pretrain_hq_en.jsonl" - split: chinese path: "pretrain_hq.jsonl" - config_name: dpo description: "Preference Optimization data (Chosen/Rejected Schema)" data_files: - split: english path: "dpo_en.jsonl" - split: chinese path: "dpo.jsonl" - config_name: reasoning description: "Reasoning / Chain-of-Thought data (Conversational Schema)" data_files: - split: english path: "r1_mix_1024_en.jsonl" - split: chinese path: "r1_mix_1024.jsonl" - config_name: rlaif description: "RL from AI Feedback data" data_files: - split: chinese path: "rlaif-mini.jsonl" - config_name: lora description: "Domain specific / Identity data for LoRA" data_files: - split: chinese path: - "lora_identity.jsonl" - "lora_medical.jsonl" --- <div align="center"> ![logo](./images/logo.png) </div> <div align="center"> ![visitors](https://visitor-badge.laobi.icu/badge?page_id=jingyaogong/minimind) [![GitHub Repo stars](https://img.shields.io/github/stars/jingyaogong/minimind?style=social)](https://github.com/jingyaogong/minimind/stargazers) [![GitHub Code License](https://img.shields.io/github/license/jingyaogong/minimind)](LICENSE) [![GitHub last commit](https://img.shields.io/github/last-commit/jingyaogong/minimind)](https://github.com/jingyaogong/minimind/commits/master) [![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/jingyaogong/minimind/pulls) [![Collection](https://img.shields.io/badge/🤗-MiniMind%20%20Collection-blue)](https://huggingface.co/collections/jingyaogong/minimind-66caf8d999f5c7fa64f399e5) </div> # 📌 Data Overview ## Ⅰ Tokenizer A tokenizer maps words from natural language to numbers like `0, 1, 36` through a “vocabulary”. You can think of each number as the page index of a word in a “dictionary”. You may choose to build your own vocabulary and train a tokenizer. The code can be found in `./scripts/train_tokenizer.py` (for learning reference only; unless necessary, there is no need to retrain one yourself, as MiniMind already comes with a tokenizer). Alternatively, you can choose well-known open-source large-model tokenizers. Just like using a Xinhua or Oxford dictionary: the advantage is excellent token compression efficiency, but the downside is that the vocabulary size is huge, often reaching hundreds of thousands of words and phrases. For a self-trained tokenizer, the advantage is that the vocabulary size and content can be freely controlled, but the disadvantage is very poor compression efficiency (for example, `"hello"` might be split into five separate tokens `"h e l l o"`), and rare words are difficult to cover. The choice of “dictionary” is certainly important. The output of an LLM is essentially a multi-class classification problem over N vocabulary tokens via SoftMax, and then decoded back into natural language through the “dictionary”. Because MiniMind’s size must be strictly controlled, in order to avoid a top-heavy model (where embedding parameters take up too large a proportion of the LLM), the vocabulary size should be as small as possible. <details style="color:rgb(128,128,128)"> <summary>Tokenizer Introduction</summary> The vocabulary sizes of some powerful third-party open-source model tokenizers such as Yi, Qwen, ChatGLM, Mistral, and Llama3 are as follows: <table>   <tr><th>Tokenizer Model</th><th>Vocabulary Size</th><th>Source</th></tr>   <tr><td>yi tokenizer</td><td>64,000</td><td>01.ai (China)</td></tr>   <tr><td>qwen2 tokenizer</td><td>151,643</td><td>Alibaba Cloud (China)</td></tr>   <tr><td>glm tokenizer</td><td>151,329</td><td>Zhipu AI (China)</td></tr>   <tr><td>mistral tokenizer</td><td>32,000</td><td>Mistral AI (France)</td></tr>   <tr><td>llama3 tokenizer</td><td>128,000</td><td>Meta (USA)</td></tr>   <tr><td>minimind tokenizer</td><td>6,400</td><td>Custom</td></tr> </table> > 👉 Update 2024-09-17: To avoid ambiguity in previous versions and control model size, all MiniMind models now use the minimind_tokenizer. All mistral_tokenizer versions have been deprecated. ```text # Some self-talk > Although minimind_tokenizer has a very small vocabulary size, its encoding/decoding efficiency is weaker than Chinese-friendly tokenizers such as qwen2 and glm. > However, MiniMind chooses its self-trained minimind_tokenizer to keep overall parameters lightweight and avoid imbalance between the embedding layer and computation layers (a top-heavy model), because MiniMind’s vocabulary size is only 6,400. > In actual testing, MiniMind has never encountered decoding failures for rare words, and the results are good. > By compressing the custom vocabulary size to 6,400, the total parameter count of the LLM can be as low as 25.8M. > The training data `tokenizer_train.jsonl` all comes from the `Jiangshu Large Model Dataset`. This part of the data is relatively secondary; if you need to train, you may freely choose your own dataset. ``` </details> ## Ⅱ Pretrain Data After the lesson learned from MiniMind-V1, where low-quality pretraining data caused the model to produce nonsense, it was decided after `2025-02-05` to no longer use large-scale unsupervised datasets for pretraining. Instead, the Chinese portion of the [Jiangshu Large Model Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data) was extracted, cleaned, and filtered to samples with character length `<512`. About 1.6GB of corpus was directly concatenated into the pretraining dataset `pretrain_hq.jsonl`, where `hq` stands for high quality (though it is still not truly “high”—improving data quality is endless). The file `pretrain_hq.jsonl` has the following data format: ```bash {"text": "How can one get rid of procrastination? Curing procrastination is not easy, but the following suggestions may help..."} ``` ## Ⅲ SFT Data [Jiangshu Large Model SFT Dataset](https://www.modelscope.cn/datasets/deepctrl/deepctrl-sft-data) “This is a complete, well-formatted, and safe resource for large-model training and research. It collects and organizes a large number of open-source datasets from public online sources, unifies their formats, and performs data cleaning. It contains a 10M-sample Chinese dataset and a 2M-sample English dataset.” The above is the official description. After downloading, the total data volume is about 4B tokens, which is certainly suitable as SFT data for a Chinese LLM. However, the official data format is very messy, and using all of it for SFT would be too costly. I performed a second round of cleaning on the official dataset, removing entries with symbol pollution and noise. In addition, only content with total length `<512` was retained. At this stage, the goal is to supplement the knowledge lacking in the pretraining stage through a large number of conversations. The exported file is `sft_512.jsonl` (~7.5GB). [Magpie-SFT Dataset](https://www.modelscope.cn/organization/Magpie-Align) This dataset collects ~1M high-quality conversations from Qwen2/2.5. I further cleaned this portion and exported samples with total length `<2048` to `sft_2048.jsonl` (~9GB). Samples with length `<1024` were exported to `sft_1024.jsonl` (~5.5GB). Using large-model dialogue data directly for SFT falls into the category of “black-box distillation”. Further cleaning of the first two SFT datasets (keeping only content with a high proportion of Chinese characters) and filtering conversations with length `<512` yields `sft_mini_512.jsonl` (~1.2GB). All SFT files `sft_X.jsonl` share the following data format: ```text {     "conversations": [         {"role": "user", "content": "Hello"},         {"role": "assistant", "content": "Hello!"},         {"role": "user", "content": "Goodbye"},         {"role": "assistant", "content": "Goodbye!"}     ] } ``` ## Ⅳ RLHF Data From the [Magpie-DPO Dataset](https://www.modelscope.cn/datasets/Magpie-Align/MagpieLM-DPO-Data-v0.1) Approximately 200k preference samples (all in English) generated by Llama3.1-70B/8B, which can be used to train a reward model and optimize response quality to better align with human preferences. Samples with total length `<3000` were reorganized into `dpo.jsonl` (~0.9GB), containing two fields: `chosen` and `rejected`. `chosen` represents the preferred response, while `rejected` represents the rejected response. The file `dpo.jsonl` has the following format: ```text {   "chosen": [     {"content": "Q", "role": "user"},      {"content": "good answer", "role": "assistant"}   ],    "rejected": [     {"content": "Q", "role": "user"},      {"content": "bad answer", "role": "assistant"}   ] } ``` ## Ⅴ Reasoning Dataset It has to be said that in February 2025, nothing was hotter than DeepSeek... This also sparked my strong interest in RL-guided reasoning models. I have already reproduced R1-Zero using Qwen2.5. If time allows and the results work (though there is a 99% chance the base model capability is insufficient), I will later update MiniMind with an RL-trained reasoning model rather than a distilled one. Given limited time, the fastest low-cost solution is still direct distillation (black-box approach). With R1 becoming extremely popular, within just a few days several R1 distillation datasets already appeared, such as [R1-Llama-70B](https://www.modelscope.cn/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B), [R1-Distill-SFT](https://www.modelscope.cn/datasets/AI-ModelScope/R1-Distill-SFT), [Alpaca-Distill-R1](https://huggingface.co/datasets/shareAI/Alpaca-Distill-R1-ZH), [deepseek_r1_zh](https://huggingface.co/datasets/jinliuxi/deepseek_r1_zh), etc. Pure Chinese data is relatively scarce. These were ultimately merged and exported as `r1_mix_1024.jsonl`, with the same data format as `sft_X.jsonl`. ## Ⅵ More Datasets Currently, [HqWu-HITCS/Awesome-Chinese-LLM](https://github.com/HqWu-HITCS/Awesome-Chinese-LLM) is collecting and organizing open-source Chinese LLM-related models, applications, datasets, and tutorials, and continuously updating the latest progress in this area. Comprehensive and professional. Respect! --- ## Ⅷ Dataset Download > [!NOTE] > After 2025-02-05, all datasets used for the final training of open-source MiniMind are provided, so there is no need to preprocess large-scale datasets yourself, avoiding redundant data processing work. MiniMind training datasets ([ModelScope](https://www.modelscope.cn/datasets/gongjy/minimind-dataset/files) | [HuggingFace](https://huggingface.co/datasets/jingyaogong)) > No need to clone everything; you can download only the files you need. Place the downloaded dataset files into the `./dataset/` directory (✨ indicates recommended required items): ```bash ./dataset/ ├── dpo.jsonl (909MB) ├── lora_identity.jsonl (22.8KB) ├── lora_medical.jsonl (34MB) ├── pretrain_hq.jsonl (1.6GB, ✨) ├── r1_mix_1024.jsonl (340MB) ├── sft_1024.jsonl (5.6GB) ├── sft_2048.jsonl (9GB) ├── sft_512.jsonl (7.5GB) ├── sft_mini_512.jsonl (1.2GB, ✨) └── tokenizer_train.jsonl (1GB) ``` <details style="color:rgb(128,128,128)"> <summary>Note: Dataset Descriptions</summary> * `dpo.jsonl` — RLHF stage dataset * `lora_identity.jsonl` — self-identity dataset (e.g., “Who are you?” “I am MiniMind…”), recommended for LoRA training (can also be used for full-parameter SFT) * `lora_medical.jsonl` — medical Q&A dataset, recommended for LoRA training (can also be used for full-parameter SFT) * `pretrain_hq.jsonl` ✨ — pretraining dataset, integrated from Jiangshu Technology * `r1_mix_1024.jsonl` — DeepSeek-R1-1.5B distilled dataset, max length per sample is 1024 (set `max_seq_len=1024` during training) * `sft_1024.jsonl` — integrated from Qwen2.5 distillation data (subset of sft_2048), max length per sample is 1024 * `sft_2048.jsonl` — integrated from Qwen2.5 distillation data, max length per sample is 2048 * `sft_512.jsonl` — integrated from Jiangshu Technology SFT data, max length per sample is 512 * `sft_mini_512.jsonl` ✨ — minimal integration of Jiangshu Technology SFT data + Qwen2.5 distillation data (for fast Zero-model training), max length per sample is 512 * `tokenizer_train.jsonl` — all from the `Jiangshu Large Model Dataset`; relatively secondary (retraining the tokenizer is not recommended, as explained above) </details> <details style="color:rgb(128,128,128)"> <summary>Notes & Recommended Training Plans</summary> * All MiniMind2 Series models were trained on a total of ~20GB of corpus, about 4B tokens, corresponding to the data combination above (Cost: 💰💰💰💰💰💰💰💰, Effect: 😊😊😊😊😊😊) * To achieve a Zero model from scratch as fast as possible, it is recommended to use the data combination `pretrain_hq.jsonl` + `sft_mini_512.jsonl`. Refer to the table below for specific cost and effectiveness (Cost: 💰, Effect: 😊😊) * Users with sufficient compute resources or who care more about performance are recommended to fully reproduce MiniMind2; users with only a single GPU or who care about fast reproduction in a short time are strongly recommended to use the latter approach * [Compromise Plan] You may also freely combine medium-scale datasets such as `sft_mini_512.jsonl` and `sft_1024.jsonl` for training (Cost: 💰💰💰, Effect: 😊😊😊😊). </details>
提供机构:
harithoppil
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作