RexiaAI/rexia-synthetic-pretrain-1m

Name: RexiaAI/rexia-synthetic-pretrain-1m
Creator: RexiaAI
Published: 2026-03-08 15:50:34
License: 暂无描述

Hugging Face2026-03-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/RexiaAI/rexia-synthetic-pretrain-1m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - instruction-tuning - chat - synthetic - conversational - pretrain size_categories: - 1M<n<10M task_categories: - text-generation dataset_info: features: - name: text dtype: string - name: source dtype: string splits: - name: train num_bytes: 1147064728 num_examples: 984372 download_size: 680967212 dataset_size: 1147064728 configs: - config_name: default data_files: - split: train path: data/train-* --- # Rexia Synthetic Pretrain 1M A synthetic instruction-following dataset of ~984k cleaned, deduplicated conversations generated using Ministral-3B (Apache 2.0 licence) across 9 diverse categories. Designed as a large-scale pre-training / continual pre-training corpus for small language models, providing high-quality, stylistically varied data that avoids the GPT-4 stylistic bias common in datasets such as OpenHermes and SlimOrca. This is the larger companion to [RexiaAI/rexia-synthetic-chat-500k](https://huggingface.co/datasets/RexiaAI/rexia-synthetic-chat-500k), roughly doubling the sample count. --- ## Dataset Details | Category | Samples | |---|---| | Conversational / advice / opinions | ~205,000 | | Factual Q&A | ~195,000 | | Concept explanation | ~160,000 | | Step-by-step reasoning | ~145,000 | | Mathematics word problems | ~89,000 | | Creative writing | ~79,000 | | Comparison / analysis | ~70,000 | | Multi-turn dialogue | ~40,000 | | Coding (Python, JS, SQL, Bash, etc.) | ~1,800 | | **Total** | **~984,000** | --- ## Format Each sample contains a `text` field formatted for instruction tuning: ``` <|user|> {question} <|assistant|> {answer}<|end|> ``` Multi-turn samples include multiple exchanges: ``` <|user|> {question_1} <|assistant|> {answer_1}<|end|> <|user|> {question_2} <|assistant|> {answer_2}<|end|> ``` A `source` field identifies the category (e.g. `synthetic_coding`, `synthetic_factual`). --- ## Generation - **Generator model:** `ministral-3b` via Ollama (Apache 2.0 licence) - **Parallelism:** Multiple concurrent workers - **Cleaning pipeline:** - Encoding artefact removal - Quality filtering (minimum length, refusal detection, alpha-character ratio, repetition check) - Exact hash deduplication - MinHash LSH near-deduplication (Jaccard threshold 0.82, 128 permutations, 5-gram shingles) --- ## Intended Use Pre-training or continual pre-training of small language models (100M–3B parameters) for general instruction following and conversational ability. The large volume and diverse category coverage help models learn broad knowledge and varied response styles before targeted fine-tuning. --- ## Relationship to rexia-synthetic-chat-500k | Dataset | Samples | Primary use | |---|---|---| | [rexia-synthetic-chat-500k](https://huggingface.co/datasets/RexiaAI/rexia-synthetic-chat-500k) | ~487k | Fine-tuning / SFT | | **rexia-synthetic-pretrain-1m** | **~984k** | Pre-training / continual pre-training | --- ## Licence Apache 2.0 — generated from Ministral-3B which is Apache 2.0 licensed.

提供机构：

RexiaAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集