qwen3-coder-480b-distill-mini

Name: qwen3-coder-480b-distill-mini
Creator: maas
Published: 2026-01-06 16:44:12
License: 暂无描述

魔搭社区2026-01-06 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/qwen3-coder-480b-distill-mini

下载链接

链接失效反馈

官方服务：

资源简介：

### **qwen3-coder-480b-distill-mini** --- ### **Short Description** This dataset is **distilled using Qwen3-Coder-480B-A35B-Instruct**. We extracted **10,000 code questions** from **microsoft/rStar-Coder** as seed problems, distilled them with **32K context**, and after cleaning and filtering, **9,543 samples remain**. **License: Apache-2.0.** --- ### **Dataset Overview** - **Seed Source:** 10,000 code reasoning problems sampled from microsoft/rStar-Coder. - **Distillation Model:** Qwen3-Coder-480B-A35B-Instruct (**480B parameters, 35B active**). - **Context Length:** Up to **32K tokens** used during distillation. - **Final Count:** **9,543 cleaned samples**. - **License:** Apache-2.0. --- ### **Why These Sources and Models?** **Why rStar-Coder?** - A **large-scale programming dataset** with **418K problems** and **580K reasoning solutions**. - Includes **input-output validation** with extensive test cases. - **High difficulty and diversity** make it ideal for constructing reliable datasets**. **Why Qwen3-Coder-480B-A35B-Instruct?** - One of the **strongest Qwen code models**, based on a **Mixture-of-Experts (MoE) design**. - **Competitive with frontier models** in **agentic coding** and **tool usage**. - Supports **long context reasoning** (native **256K**, extendable to **1M tokens**). - **Apache-2.0 license** enables both **research** and **commercial applications**. --- ### **Data Cleaning Process** 1. **Deduplication** – Keep only unique questions. 2. **Integrity Check** – Remove entries missing **question/answer/reasoning**. 3. **Format Validation** – Filter out **unbalanced brackets** and **truncated reasoning**. 4. **Content Filtering** – Remove **overly templated** or **highly repetitive** reasoning traces. 5. **Final Output** – **9,543 high-quality samples**. --- ### **Field Schema** Each JSONL record contains: - **generator** : Qwen3-Coder-480B-A35B-Instruct - **category** : code - **Input:** Problem extracted from rStar-Coder - **code_output:** Final distilled answer - **qid:** Unique ID (hash or UUID) --- ### **License** This dataset is released under the **Apache-2.0 License**.

### **qwen3-coder-480b-distill-mini** --- ### **简短描述** 本数据集基于**Qwen3-Coder-480B-A35B-Instruct**完成蒸馏构建。我们从**microsoft/rStar-Coder**中抽取10000道代码问题作为种子任务，基于32K Token上下文窗口完成蒸馏流程，经清洗与过滤后最终保留9543条有效样本。 **开源协议：Apache-2.0**。 --- ### **数据集概览** - **种子来源**：从**microsoft/rStar-Coder**中采样得到的10000道代码推理问题。 - **蒸馏模型**：Qwen3-Coder-480B-A35B-Instruct（总参数量4800亿，激活参数量350亿）。 - **上下文长度**：蒸馏过程中支持最高32K Token。 - **最终样本量**：经清洗后的有效样本共9543条。 - **开源协议**：Apache-2.0。 --- ### **为何选择该数据源与模型？** #### **为何选择rStar-Coder？** - rStar-Coder是大规模编程数据集，包含41.8万道编程问题与58万套推理解决方案。 - 该数据集配套丰富测试用例，支持输入输出验证流程。 - 其题目难度梯度分明且场景多样性充足，是构建可靠数据集的理想素材。 #### **为何选择Qwen3-Coder-480B-A35B-Instruct？** - 该模型是通义千问（Qwen）系列顶尖代码模型之一，采用混合专家模型（Mixture-of-Experts，MoE）架构。 - 在智能体编程与工具调用领域，其性能可与前沿大模型比肩。 - 支持长上下文推理，原生上下文窗口可达256K Token，可扩展至100万Token。 - 采用Apache-2.0开源协议，可同时支持学术研究与商业落地应用。 --- ### **数据清洗流程** 1. **去重处理**：仅保留唯一题目。 2. **完整性校验**：移除缺失题目、答案或推理过程的条目。 3. **格式校验**：过滤存在括号不匹配、推理内容截断的样本。 4. **内容过滤**：移除过度模板化或重复度极高的推理轨迹。 5. **最终产出**：共得到9543条高质量有效样本。 --- ### **字段规范** 每条JSONL格式的样本包含以下字段： - **生成器**：Qwen3-Coder-480B-A35B-Instruct - **类别**：代码 - **输入**：从rStar-Coder中抽取的编程题目 - **代码输出**：经蒸馏得到的最终答案 - **样本ID**：唯一标识符（哈希值或UUID） --- ### **开源协议** 本数据集采用**Apache-2.0开源协议**发布。

提供机构：

maas

创建时间：

2025-08-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集