Cukinator/cpu1-ablation-dataset

Name: Cukinator/cpu1-ablation-dataset
Creator: Cukinator
Published: 2026-04-11 18:36:31
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Cukinator/cpu1-ablation-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en size_categories: - 100M<n<1B --- # CPU-1 Ablation Dataset (Knowledge Distillation) This dataset contains pre-computed teacher log probabilities and hidden states extracted from **Qwen/Qwen2.5-3B** over a subset of HuggingFaceFW/fineweb, specifically designed for knowledge distillation into the CPU-1 Byte-Level architecture. ## Overview - **Source Text:** HuggingFaceFW/fineweb (Sample-10BT) - **Teacher Model:** Qwen/Qwen2.5-3B - **Document Count:** ~850,000 documents - **Distillation Method:** BPE-to-Byte marginalization & Raw BPE logits. ## Dataset Structure The dataset comprises two synchronized sub-datasets. Each sequence has a length of 5000 items (BPE tokens or 4-byte patches). Shards hold exactly 50 sequences to align perfectly with the multi-processor streaming logic. ### 1. `byte_marginalized/` Used for training the byte-level MLGRU models. - **patches** `[seq_len, 4] uint8`: Flattened list. The 4-byte input patches. - **targets** `[seq_len, 4] uint8`: Flattened list. The next 4-byte patches, shifted by 1. - **teacher_probs** `[seq_len, 256] float32`: Flattened list. The marginalized probability distribution for the *first byte* of the target patch. - **teacher_mask** `[seq_len] bool`: Positional mask indicating if a BPE boundary aligns with the patch, representing valid teacher signals. - **teacher_hidden** `[seq_len, teacher_dim] float32` *(Optional)*: Flattened list. The projected internal hidden state of the teacher model, used for embedding alignment loss. - **teacher_dim** `int32` *(Optional)*: The projected dimension size (e.g. 128). ### 2. `bpe_tokenized/` Used for ablations on the BPE-level architectures. - **input_ids** `[seq_len] int32`: The BPE token IDs. - **labels** `[seq_len] int32`: The shifted BPE targets. - **teacher_probs_bpe** `[seq_len, 128] float32`: Flattened list. The raw P(next_token) probabilities of the Top-128 predictions. - **teacher_ids_bpe** `[seq_len, 128] int32`: Flattened list. The vocabulary IDs corresponding to the Top-128 predictions. - **teacher_mask_bpe** `[seq_len] bool`: Indicates presence of teacher signal. ## Architecture Robustness (UTF-8) This dataset is guaranteed to have **100% mathematically accurate BPE-to-Byte offsets**. Rather than reconstructing partial bytes from BPE tokens (which fails dramatically on multi-byte UTF-8 sequences such as Emojis or Kanji when tokens are split), the extraction engine uses the Rust Fast-Tokenizer's internal `offset_mapping` to slice the direct UTF-8 pure string bytes. This eliminates offset drift and ensures the marginalized sequences act completely independently of BPE chunking patterns. ## Purpose This dataset is tightly coupled with the CPU-1 Ablation Suite. It prevents running the heavy Qwen2.5-3B teacher forward pass more than once. All 15 CPU-1 ablation experimental runs stream from this unified dataset without re-computation.

许可证：MIT 任务类别： - 文本生成语言： - 英语数据规模：100M<n<1B # CPU-1 消融数据集（知识蒸馏）本数据集包含从**Qwen/Qwen2.5-3B**中提取的预计算教师模型对数几率与隐状态，其数据取自HuggingFaceFW/fineweb的子集，专为将知识蒸馏至CPU-1字节级架构而设计。 ## 概览 - **源文本**：HuggingFaceFW/fineweb（Sample-10BT子集） - **教师模型**：Qwen/Qwen2.5-3B - **文档数量**：约85万份文档 - **蒸馏方法**：字节对编码（BPE）到字节的边缘化处理与原始BPE对数几率（logits）。 ## 数据集结构本数据集包含两个同步子数据集。每个序列的长度为5000个单元（BPE词元（Token）或4字节块）。每个数据分片恰好包含50个序列，以完美匹配多处理器流式处理逻辑。 ### 1. `byte_marginalized/` 用于训练字节级MLGRU模型。 - **patches** `[seq_len, 4] uint8`：扁平化列表。即4字节输入块。 - **targets** `[seq_len, 4] uint8`：扁平化列表。即向后偏移1位的下一个4字节块。 - **teacher_probs** `[seq_len, 256] float32`：扁平化列表。即目标块**首个字节**的边缘化概率分布。 - **teacher_mask** `[seq_len] bool`：位置掩码，用于标记BPE边界是否与当前块对齐，以标识有效的教师信号。 - **teacher_hidden** `[seq_len, teacher_dim] float32`（可选）：扁平化列表。即教师模型投影后的内部隐状态，用于嵌入对齐损失计算。 - **teacher_dim** `int32`（可选）：投影后的维度大小（例如128）。 ### 2. `bpe_tokenized/` 用于针对BPE级架构的消融实验。 - **input_ids** `[seq_len] int32`：BPE词元（Token）ID序列。 - **labels** `[seq_len] int32`：向后偏移的BPE目标序列。 - **teacher_probs_bpe** `[seq_len, 128] float32`：扁平化列表。即Top-128预测结果的原始下一个词元（Token）概率P(next_token)。 - **teacher_ids_bpe** `[seq_len, 128] int32`：扁平化列表。即与Top-128预测结果对应的词汇表ID。 - **teacher_mask_bpe** `[seq_len] bool`：用于标记教师信号是否存在。 ## 架构鲁棒性（UTF-8）本数据集确保**100%数学意义上准确的BPE到字节偏移量**。相较于从BPE词元（Token）中重构部分字节的方法（当词元（Token）被拆分时，该方法在表情符号或汉字等多字节UTF-8序列上会彻底失效），本数据集的提取引擎使用Rust Fast-Tokenizer内置的`offset_mapping`直接切片UTF-8纯字符串字节。这消除了偏移漂移问题，并确保边缘化序列完全不受BPE分块模式的影响。 ## 用途本数据集与CPU-1消融套件紧密绑定，可避免重复执行计算量巨大的Qwen2.5-3B教师模型前向传播。全部15项CPU-1消融实验均直接从该统一数据集流式读取数据，无需重复计算。

提供机构：

Cukinator

搜集汇总

数据集介绍

构建方式

在知识蒸馏研究领域，构建高质量的训练数据是提升模型性能的关键。CPU-1 Ablation Dataset的构建过程依托于大规模文本语料HuggingFaceFW/fineweb的Sample-10BT子集，从中精选约85万份文档作为源文本。通过预训练模型Qwen/Qwen2.5-3B对文本进行前向计算，提取教师模型的隐藏状态与对数概率，并采用BPE到字节的边际化处理以及原始BPE对数概率两种蒸馏方法。数据序列统一长度为5000项，每50个序列组成一个分片，以适配多处理器流式处理逻辑，确保计算效率与数据对齐。

使用方法

在模型蒸馏与架构消融研究中，该数据集作为CPU-1消融实验套件的核心数据源，旨在避免重复运行沉重的教师模型前向传递。使用者可通过流式加载方式，从数据集中读取字节边际化或BPE分词化子集，分别用于训练字节级MLGRU模型或进行BPE级架构的消融分析。数据集中的教师概率、掩码及隐藏状态可直接用于计算蒸馏损失，如嵌入对齐损失，从而高效支持多达15种消融实验的并行执行，显著提升研究迭代速度。

背景与挑战

背景概述

在知识蒸馏领域，高效地将大型语言模型的能力迁移至轻量化架构是推动边缘智能发展的关键。CPU-1 Ablation Dataset 应运而生，由研究团队基于 Qwen/Qwen2.5-3B 模型与 HuggingFaceFW/fineweb 数据子集构建，旨在为 CPU-1 Byte-Level 架构提供预计算的教师模型输出。该数据集通过精确的字节级与 BPE 令牌级对齐，支持对知识蒸馏过程的系统化消融研究，为模型压缩与架构创新提供了标准化实验基础。

当前挑战

该数据集致力于解决知识蒸馏中教师模型信号与字节级目标对齐的挑战，尤其在处理多字节 UTF-8 序列时，需确保 BPE 令牌到原始字节偏移的数学精确性，避免因令牌分割导致的语义失真。构建过程中，研究团队需设计高效的提取引擎，利用 Rust Fast-Tokenizer 的偏移映射机制，以消除偏移漂移，并保证大规模序列数据的同步与对齐，同时维持数据在分布式流处理中的结构一致性。

常用场景

经典使用场景

在知识蒸馏领域，CPU-1 Ablation Dataset为研究字节级语言模型架构提供了关键支撑。该数据集通过预计算的教师模型对数概率和隐藏状态，专门用于将Qwen2.5-3B等大型模型的知识迁移至轻量级的CPU-1字节级架构。其经典应用场景涉及训练字节级多层门控循环单元（MLGRU）模型，同时支持对字节对编码（BPE）层级架构进行消融实验，从而系统评估不同编码策略对模型性能的影响。

解决学术问题

该数据集有效解决了知识蒸馏中教师模型前向传播计算成本高昂的学术难题。通过一次性提取并固化教师模型的输出信号，研究者能够避免重复运行参数量庞大的教师模型，显著降低实验开销。此外，数据集确保了BPE到字节偏移的数学精确性，克服了多字节UTF-8序列（如表情符号或汉字）处理中的偏移漂移问题，为字节级语言建模提供了可靠的数据基础。

实际应用

在实际应用中，CPU-1 Ablation Dataset服务于高效轻量级语言模型的开发与优化。它支持在资源受限的环境中部署高性能的自然语言处理系统，例如嵌入式设备或边缘计算场景。通过知识蒸馏，小型模型能够继承大型教师模型的语义理解能力，从而在文本生成、机器翻译等任务中实现接近原模型的性能，同时大幅减少计算和存储需求。

数据集最近研究