RexiaAI/rexia-synthetic-pretrain-1m
收藏Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RexiaAI/rexia-synthetic-pretrain-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- instruction-tuning
- chat
- synthetic
- conversational
- pretrain
size_categories:
- 1M<n<10M
task_categories:
- text-generation
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 1147064728
num_examples: 984372
download_size: 680967212
dataset_size: 1147064728
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Rexia Synthetic Pretrain 1M
A synthetic instruction-following dataset of ~984k cleaned, deduplicated conversations generated using Ministral-3B (Apache 2.0 licence) across 9 diverse categories.
Designed as a large-scale pre-training / continual pre-training corpus for small language models, providing high-quality, stylistically varied data that avoids the GPT-4 stylistic bias common in datasets such as OpenHermes and SlimOrca.
This is the larger companion to [RexiaAI/rexia-synthetic-chat-500k](https://huggingface.co/datasets/RexiaAI/rexia-synthetic-chat-500k), roughly doubling the sample count.
---
## Dataset Details
| Category | Samples |
|---|---|
| Conversational / advice / opinions | ~205,000 |
| Factual Q&A | ~195,000 |
| Concept explanation | ~160,000 |
| Step-by-step reasoning | ~145,000 |
| Mathematics word problems | ~89,000 |
| Creative writing | ~79,000 |
| Comparison / analysis | ~70,000 |
| Multi-turn dialogue | ~40,000 |
| Coding (Python, JS, SQL, Bash, etc.) | ~1,800 |
| **Total** | **~984,000** |
---
## Format
Each sample contains a `text` field formatted for instruction tuning:
```
<|user|>
{question}
<|assistant|>
{answer}<|end|>
```
Multi-turn samples include multiple exchanges:
```
<|user|>
{question_1}
<|assistant|>
{answer_1}<|end|>
<|user|>
{question_2}
<|assistant|>
{answer_2}<|end|>
```
A `source` field identifies the category (e.g. `synthetic_coding`, `synthetic_factual`).
---
## Generation
- **Generator model:** `ministral-3b` via Ollama (Apache 2.0 licence)
- **Parallelism:** Multiple concurrent workers
- **Cleaning pipeline:**
- Encoding artefact removal
- Quality filtering (minimum length, refusal detection, alpha-character ratio, repetition check)
- Exact hash deduplication
- MinHash LSH near-deduplication (Jaccard threshold 0.82, 128 permutations, 5-gram shingles)
---
## Intended Use
Pre-training or continual pre-training of small language models (100M–3B parameters) for general instruction following and conversational ability. The large volume and diverse category coverage help models learn broad knowledge and varied response styles before targeted fine-tuning.
---
## Relationship to rexia-synthetic-chat-500k
| Dataset | Samples | Primary use |
|---|---|---|
| [rexia-synthetic-chat-500k](https://huggingface.co/datasets/RexiaAI/rexia-synthetic-chat-500k) | ~487k | Fine-tuning / SFT |
| **rexia-synthetic-pretrain-1m** | **~984k** | Pre-training / continual pre-training |
---
## Licence
Apache 2.0 — generated from Ministral-3B which is Apache 2.0 licensed.
提供机构:
RexiaAI



