Verdugie/opus-4.6-training-catalog

Name: Verdugie/opus-4.6-training-catalog
Creator: Verdugie
Published: 2026-03-24 04:27:26
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Verdugie/opus-4.6-training-catalog

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - claude - opus-4.6 - reasoning - coding - conversation - distillation - synthetic size_categories: - 100K<n<1M --- # Opus 4.6 Community Training Catalog A curated, cleaned, and deduplicated collection of community-created Claude Opus 4.6 distillation datasets from HuggingFace. Contains reasoning traces, coding examples, and conversational data generated by `claude-opus-4-6`. ## Dataset Summary | Metric | Value | |--------|-------| | **Total conversations** | 168,301 | | **Format** | JSONL — ShareGPT-style (messages array with role/content) | | **Size** | ~154 MB | | **Language** | English | ## Splits | Split | Rows | Size | Description | |-------|------|------|-------------| | `reasoning.jsonl` | 166,698 | 133 MB | Reasoning, math, logic, chain-of-thought traces | | `coding.jsonl` | 662 | 19 MB | Programming, software engineering, high-reasoning coding | | `conversation.jsonl` | 941 | 1.8 MB | Relational conversation, stance distillation | ## Source Datasets (6 verified) Every dataset was verified by reading its HuggingFace README to confirm Claude Opus 4.6 was explicitly stated as the generation model. | # | Source | Rows | Split | License | |---|--------|------|-------|---------| | 1 | [owenisas/opus46-reasoning-mix-full](https://huggingface.co/datasets/owenisas/opus46-reasoning-mix-full) | 156,293 | reasoning | Unspecified | | 2 | [Roman1111111/claude-opus-4.6-10000x](https://huggingface.co/datasets/Roman1111111/claude-opus-4.6-10000x) | 7,464 | reasoning | MIT | | 3 | [LEGENDQ/Claude-Opus-4.6-Reasoning-Dataset](https://huggingface.co/datasets/LEGENDQ/Claude-Opus-4.6-Reasoning-Dataset) | 2,056 | reasoning | Apache-2.0 | | 4 | [TeichAI/Claude-Opus-4.6-Reasoning-927x](https://huggingface.co/datasets/TeichAI/Claude-Opus-4.6-Reasoning-927x) | 885 | reasoning | Apache-2.0 | | 5 | [dalisoft/claude-opus-4.6-high-reasoning-700x](https://huggingface.co/datasets/dalisoft/claude-opus-4.6-high-reasoning-700x) | 662 | coding | Apache-2.0 | | 6 | [aptgetupdate/Claude-Opus-4.6-stance-distilled-RELATIONAL](https://huggingface.co/datasets/aptgetupdate/Claude-Opus-4.6-stance-distilled-RELATIONAL) | 941 | conversation | CC-BY-SA-4.0 | ## Data Format Each row is a JSON object with a unified schema: ```json { "source": "owenisas", "topic": "reasoning", "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ] } ``` The `source` field identifies the original dataset creator. The `topic` field indicates the content category. Messages follow the standard ShareGPT format compatible with most fine-tuning frameworks (Axolotl, Unsloth, LLaMA-Factory, etc.). ## Cleaning Methodology 1. **Collection**: Exhaustive HuggingFace Hub search using queries: `claude opus 4.6`, `opus-4-6`, `claude-opus-4-6`, `opus46`, etc. 2. **Verification**: Each dataset's README manually checked to confirm Opus 4.6 as the generation model. 17 candidates rejected (forks, mixed-model, unconfirmed, or duplicates). 3. **Format normalization**: All datasets converted to a unified `{source, topic, messages}` schema. 4. **Deduplication**: Cross-dataset deduplication to remove exact and near-duplicate conversations. 5. **Quality filtering**: Empty rows, malformed messages, and garbage content removed. ## Intended Use Fine-tuning and distillation of open-source language models on high-quality Opus 4.6 reasoning traces, coding examples, and conversational patterns. Suitable for LoRA/QLoRA training on 7B–72B parameter models targeting improved reasoning and instruction-following. ## Limitations - All data is synthetically generated by Claude Opus 4.6. It inherits any biases or limitations of the source model. - The dataset is English-only. - Licensing varies by source dataset — check individual source licenses for specific use cases. ## Citation Curated by [Verdugie](https://huggingface.co/Verdugie). Original data created by the respective dataset authors listed above.

提供机构：

Verdugie

5,000+

优质数据集

54 个

任务类型

进入经典数据集