RexiaAI/rexia-synthetic-chat-500k
收藏Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RexiaAI/rexia-synthetic-chat-500k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- instruction-tuning
- chat
- synthetic
- conversational
size_categories:
- 100K<n<1M
---
# Rexia Synthetic Chat 500k
A synthetic instruction-following dataset of ~487k cleaned, deduplicated
conversations generated using [Ministral-3B](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2410)
(Apache 2.0 licence) across 9 diverse categories.
Designed to provide high-quality, stylistically varied fine-tuning data
for small language models, avoiding the GPT-4 stylistic bias common in
datasets like OpenHermes and SlimOrca.
## Dataset Details
| Category | Samples |
|---|---|
| Factual Q&A | ~80,000 |
| Coding (Python, JS, SQL, bash, etc.) | ~78,000 |
| Conversational / advice / opinions | ~70,000 |
| Concept explanation | ~60,000 |
| Step-by-step reasoning | ~58,000 |
| Mathematics word problems | ~42,000 |
| Creative writing | ~40,000 |
| Comparison / analysis | ~30,000 |
| Multi-turn dialogue | ~30,000 |
| **Total** | **~487,000** |
## Format
Each sample contains a `text` field formatted for instruction tuning:
```
<|user|>
{question}
<|assistant|>
{answer}<|end|>
```
Multi-turn samples include multiple exchanges:
```
<|user|>
{question_1}
<|assistant|>
{answer_1}<|end|>
<|user|>
{question_2}
<|assistant|>
{answer_2}<|end|>
```
A `source` field identifies the category (e.g. `synthetic_coding`, `synthetic_factual`).
## Generation
- **Generator model:** `ministral-3:3b` via Ollama
- **Parallelism:** 6 concurrent workers
- **Cleaning:** encoding artefact removal, quality filtering (min length,
refusal detection, alpha ratio, repetition check)
- **Deduplication:** exact hash dedup + MinHash LSH near-dedup
(Jaccard threshold 0.82, 128 permutations, 5-gram shingles)
- **Total removed:** ~12,500 samples (2.5%)
## Intended Use
Fine-tuning small language models (100M–1B parameters) for instruction
following and conversational ability. The diverse category coverage and
varied response styles help prevent models from collapsing to narrow
stylistic patterns.
## Licence
Apache 2.0 — generated from Ministral-3B which is Apache 2.0 licensed.
提供机构:
RexiaAI



