MinimaML/Alexandria-100K

Name: MinimaML/Alexandria-100K
Creator: MinimaML
Published: 2025-12-16 15:57:19
License: 暂无描述

Hugging Face2025-12-16 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/MinimaML/Alexandria-100K

下载链接

链接失效反馈

官方服务：

资源简介：

Alexandria 100k是一个高质量的、富含推理的数据集，旨在训练“Thinker”模型。与专注于简短回答的标准指令数据集不同，Alexandria明确针对内部推理、思维链和详细解释。该数据集由Qwen3-Next-80B-A3B-Thinking生成，将原始知识提炼为结构化的、教学式的课程。数据集包含100,000个样本，格式为JSONL（Prompt, Completion, Category），严格分为五个“Bucket”，每个Bucket旨在训练学生模型的特定能力：学术（35%）、指令（25%）、创意（15%）、代码（10%）和记忆（15%）。适用于“指令预训练”（从头开始训练），特别适合训练Phi/Orca风格的模型，因为它同时教授模型知识和思维过程。

Alexandria 100k is a high-quality, reasoning-dense dataset designed to train "Thinker" models. Unlike standard instruction datasets that focus on short answers, Alexandria explicitly targets internal reasoning, chain-of-thought, and detailed explanations. Generated by Qwen3-Next-80B-A3B-Thinking, this dataset refines raw knowledge into structured, pedagogical lessons. The dataset contains 100,000 samples in JSONL format (Prompt, Completion, Category), strictly curated into five "Buckets," each designed to train a specific capability of the Student model: Academic (35%), Instruction (25%), Creative (15%), Code (10%), and Memory (15%). Suitable for "Instruction Pre-training" (From Scratch), it is ideal for training Phi/Orca-style models from scratch, as it teaches the model both the knowledge and the thought process simultaneously.

提供机构：

MinimaML

5,000+

优质数据集

54 个

任务类型

进入经典数据集