five

RexiaAI/rexia-synthetic-chat-500k

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RexiaAI/rexia-synthetic-chat-500k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - instruction-tuning - chat - synthetic - conversational size_categories: - 100K<n<1M --- # Rexia Synthetic Chat 500k A synthetic instruction-following dataset of ~487k cleaned, deduplicated conversations generated using [Ministral-3B](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2410) (Apache 2.0 licence) across 9 diverse categories. Designed to provide high-quality, stylistically varied fine-tuning data for small language models, avoiding the GPT-4 stylistic bias common in datasets like OpenHermes and SlimOrca. ## Dataset Details | Category | Samples | |---|---| | Factual Q&A | ~80,000 | | Coding (Python, JS, SQL, bash, etc.) | ~78,000 | | Conversational / advice / opinions | ~70,000 | | Concept explanation | ~60,000 | | Step-by-step reasoning | ~58,000 | | Mathematics word problems | ~42,000 | | Creative writing | ~40,000 | | Comparison / analysis | ~30,000 | | Multi-turn dialogue | ~30,000 | | **Total** | **~487,000** | ## Format Each sample contains a `text` field formatted for instruction tuning: ``` <|user|> {question} <|assistant|> {answer}<|end|> ``` Multi-turn samples include multiple exchanges: ``` <|user|> {question_1} <|assistant|> {answer_1}<|end|> <|user|> {question_2} <|assistant|> {answer_2}<|end|> ``` A `source` field identifies the category (e.g. `synthetic_coding`, `synthetic_factual`). ## Generation - **Generator model:** `ministral-3:3b` via Ollama - **Parallelism:** 6 concurrent workers - **Cleaning:** encoding artefact removal, quality filtering (min length, refusal detection, alpha ratio, repetition check) - **Deduplication:** exact hash dedup + MinHash LSH near-dedup (Jaccard threshold 0.82, 128 permutations, 5-gram shingles) - **Total removed:** ~12,500 samples (2.5%) ## Intended Use Fine-tuning small language models (100M–1B parameters) for instruction following and conversational ability. The diverse category coverage and varied response styles help prevent models from collapsing to narrow stylistic patterns. ## Licence Apache 2.0 — generated from Ministral-3B which is Apache 2.0 licensed.
提供机构:
RexiaAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作