MinimaML/Alexandria-100K
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/MinimaML/Alexandria-100K
下载链接
链接失效反馈官方服务:
资源简介:
Alexandria 100k是一个高质量的、富含推理的数据集,旨在训练“Thinker”模型。与专注于简短回答的标准指令数据集不同,Alexandria明确针对内部推理、思维链和详细解释。该数据集由Qwen3-Next-80B-A3B-Thinking生成,将原始知识提炼为结构化的、教学式的课程。数据集包含100,000个样本,格式为JSONL(Prompt, Completion, Category),严格分为五个“Bucket”,每个Bucket旨在训练学生模型的特定能力:学术(35%)、指令(25%)、创意(15%)、代码(10%)和记忆(15%)。适用于“指令预训练”(从头开始训练),特别适合训练Phi/Orca风格的模型,因为它同时教授模型知识和思维过程。
Alexandria 100k is a high-quality, reasoning-dense dataset designed to train "Thinker" models. Unlike standard instruction datasets that focus on short answers, Alexandria explicitly targets internal reasoning, chain-of-thought, and detailed explanations. Generated by Qwen3-Next-80B-A3B-Thinking, this dataset refines raw knowledge into structured, pedagogical lessons. The dataset contains 100,000 samples in JSONL format (Prompt, Completion, Category), strictly curated into five "Buckets," each designed to train a specific capability of the Student model: Academic (35%), Instruction (25%), Creative (15%), Code (10%), and Memory (15%). Suitable for "Instruction Pre-training" (From Scratch), it is ideal for training Phi/Orca-style models from scratch, as it teaches the model both the knowledge and the thought process simultaneously.
提供机构:
MinimaML



