five

CompactAI-O/PRISM-K48-Gemma4.E2B

收藏
Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CompactAI-O/PRISM-K48-Gemma4.E2B
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering language: - en tags: - instruction pretty_name: PRISM-K48-Gemma4.E2B size_categories: - n<1K --- # CompactAI-Prism ## High-Density Distillation Dataset for Small Model English Language Acquisition **License:** MIT **Top-K:** 48 (Current release: K48) **Source Model:** Gemma4 E2B **Primary Objective:** Teach small-scale AI models to generate fluent, coherent English text through probability-aware distillation. Or at least help them sound less like they learned English from a fortune cookie. --- ## Overview CompactAI-Prism is a specialized training dataset designed to accelerate English language acquisition in compact AI models. Unlike standard instruction-tuning datasets that provide only a (Prompt, Response) pair, CompactAI-Prism captures the decision landscape of the teacher model. For every token generated in the response, we record the Top-K alternative tokens and their associated log-probabilities. Think of it as giving your tiny model a peek at the teacher's scratch paper during the exam. We won't tell if you don't. This approach increases the information density of the dataset by Kx per prompt, allowing student models to learn not just what the model answered, but what else it considered and why those options were rejected. ### The Math (Yes, There Is Math) If a standard dataset provides 1 signal per token, CompactAI-Prism provides K signals. We know, we know - you became an AI researcher to avoid math. Sorry. Total Training Signals = (Tokens per Response) x (Number of Questions) x K For this release: - Tokens per response (x): {AVG_RESPONSE_TOKENS} (Too lazy to check this. Might update later) - Number of questions (y): 100 - Top-K value: 48 - Effective training examples: {AVG_RESPONSE_TOKENS} x 100 x 48 = 480 * {AVG_RESPONCE_TOKENS} That is a lot of tokens. You are welcome. --- ## Why Call It PRISM? Great question. We considered "CompactAI-Overthinker" and "TinyModelTherapySession", but those did not fit on a GitHub repo name. The name "Prism" reflects the core mechanism of this dataset: 1. **Single Input, Spectrum Revealed**: Just as a prism takes a single beam of white light and refracts it to reveal the full spectrum of colors within, CompactAI-Prism takes a single AI response and refracts it to reveal the full spectrum of token probabilities that existed at each generation step. Also, prisms look cool in stock photos. 2. **Hidden Structure Made Visible**: A prism does not create new colors; it exposes what was already present but invisible to the naked eye. Similarly, this dataset does not alter the teacher model's output; it exposes the latent probability distribution that guided each token choice. Like an X-ray, but for indecision. 3. **Clarity Through Decomposition**: By decomposing the generation process into its constituent probabilistic components, we enable student models to learn with greater clarity. They see not only the path taken, but the roads not taken - and the relative likelihood of each. It is like watching a choose-your-own-adventure book write itself, then regretting every choice. In short: Prism turns opaque generation into transparent learning. Or at least slightly less opaque. Baby steps. --- ## Series Purpose: English Language Foundation for Small Models This dataset series is explicitly designed to teach small models to speak English. Because let us be honest - some of them really need it. By exposing compact architectures to the full probability distribution of token choices made by a capable teacher model, we enable: - Faster convergence on grammatical English structures (goodbye, "me want food") - Improved token selection confidence in low-parameter regimes (no more second-guessing every comma) - Better handling of ambiguous or open-ended prompts (sometimes a question is just a question) - Reduced hallucination through uncertainty-aware training (your model will still lie, but with more confidence intervals) --- ## Data Format The dataset is provided in JSONL format. Each line represents a complete conversation turn with embedded probability data. ### Schema ```json { "messages": [ {"role": "user", "content": "STRING"}, {"role": "assistant", "content": "STRING"} ], "response_tokens": INT, "token_logprobs": [ { "position": INT, "generated_token_id": INT, "generated_token": "STRING", "logprob": FLOAT, "top_k": [ {"token_id": INT, "token": "STRING", "logprob": FLOAT} ] } ] } ``` ### Training Applications 1. **KL Divergence Distillation:** Use the full top_k distribution to minimize KL divergence between student and teacher, rather than just matching the chosen token. It is like teaching by example, but with more calculus. 2. **Confidence-Calibrated Generation:** Train small models to output confidence scores by learning from the teacher's logprob distributions. Now your tiny model can say "I am 73 percent sure that is correct" instead of just confidently being wrong. 3. **Alternative-Aware Decoding:** During inference, use knowledge of plausible alternatives to improve beam search or sampling strategies. Or just ignore this and use temperature=0.7 like everyone else. No judgment. 4. **English Fluency Bootstrapping:** Focus training on high-probability English token sequences to rapidly establish grammatical foundations in sub-10M parameter models. Because "me eat apple" is charming for about five minutes, then it gets old. --- ## Dataset Statistics | Metric | Value | |--------|-------| | Total prompts | 100 | | Top-K per position | 48 | | Number of times we questioned our life choices while building this | Yes | --- ## License This project is licensed under the MIT License. Which is fancy legal speak for "use this however you want, just do not sue us when your tiny model starts writing poetry about toaster ovens." Copyright (c) 2026 CompactAI Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Translation: If your model learns to speak perfect English but also develops an existential crisis, that is on you. --- ## Source Attribution - **Dataset:** TeichAI Claude Sonnet 4.6 799 Prompts - **Teacher Model:** Gemma4 E2B Thats it :) --- ## Citation If you use CompactAI-Prism in your research or development, please cite: ```bibtex @dataset{CompactAI-O/PRISM-K48-Gemma4.E2}, title = {CompactAI-Prism: Top-48 Probability Distillation for Small Model English Training}, author = {CompactAI}, year = 2026, url = https://huggingface.co/datasets/CompactAI/cAI-Prism-K48, } ``` Or just mention us in your paper's acknowledgments. We like hearing our names. --- ## Final Thoughts Look, we are not perfect. This dataset might have bugs. The documentation might have typos. Our jokes might fall flat. But we tried. And in the grand tradition of small models everywhere, we believe that effort counts for something. Now go train something tiny and wonderful. And if it starts speaking perfect English, maybe send us a note. We would love to hear about it. Or not. No pressure. We have trust issues.
提供机构:
CompactAI-O
搜集汇总
数据集介绍
main_image_url
构建方式
在自然语言处理领域,知识蒸馏是提升小型模型性能的关键技术。PRISM-K48-Gemma4.E2B数据集采用概率感知蒸馏方法构建,其核心在于捕捉教师模型Gemma4 E2B在生成每个响应词元时的决策全景。对于每个生成的词元,数据集不仅记录最终选择,还系统性地收录了Top-K个替代词元及其对应的对数概率。这种设计将传统指令调优数据中的单一信号扩展为多维概率分布,使得每个提示的信息密度提升至K倍,从而为小型模型提供了更为丰富的学习素材。数据源自TeichAI Claude Sonnet 4.6的799个提示,通过结构化处理转化为包含完整概率谱系的JSONL格式,实现了从教师模型隐性知识到显性训练信号的转化。
特点
该数据集最显著的特征在于其高信息密度与透明化生成机制。它如同一个语言决策的棱镜,将教师模型生成文本时隐含的概率分布折射为可见的光谱,使小型模型能够洞察每个词元选择背后的完整概率景观。数据集不仅提供标准的问答对,更嵌入了每个生成步骤的Top-48替代选项,这为模型训练引入了不确定性感知维度。这种设计尤其有利于提升小型模型在英语语法结构、词元选择置信度及歧义处理等方面的能力。通过暴露生成过程中的潜在概率结构,数据集助力模型超越单纯模仿,实现更深层次的语言理解与生成。
使用方法
该数据集主要应用于小型模型的英语语言能力蒸馏与优化。在训练过程中,研究者可利用其完整的Top-K概率分布进行KL散度蒸馏,最小化学生模型与教师模型在整体概率空间上的差异,而非仅仅匹配选定词元。数据集支持置信度校准生成训练,使小型模型能够学习输出与教师模型概率分布相一致的置信度分数。此外,其丰富的替代词元信息可用于增强推理阶段的解码策略,例如改进束搜索或采样过程。对于旨在快速建立英语语法基础的小参数量模型,可聚焦于高概率英语词元序列进行训练,从而有效提升生成文本的流畅性与连贯性。
背景与挑战
背景概述
在人工智能模型小型化与高效化的发展浪潮中,提升紧凑模型的自然语言生成能力成为关键研究课题。PRISM-K48-Gemma4.E2B数据集由CompactAI团队于2026年创建,其核心目标在于通过概率感知蒸馏技术,促进小规模模型快速掌握流畅、连贯的英语文本生成能力。该数据集创新性地记录了教师模型Gemma4 E2B在生成每个响应词元时的Top-K替代词元及其对数概率,从而将传统指令调优数据的信息密度提升K倍,为模型提供了更为丰富的决策依据。这一方法旨在解决小参数模型在语法结构、词元选择信心及开放性提示处理等方面的固有局限,为紧凑模型的英语语言基础构建提供了新的数据范式。
当前挑战
该数据集致力于应对小规模模型在英语语言习得领域所面临的挑战,核心问题在于如何使参数受限的模型生成符合语法规范、语义连贯且自然流畅的文本。传统方法仅提供提示-响应对,难以传递教师模型内部的概率分布信息,导致学生模型学习效率低下,易产生语法错误或语义脱节。在构建过程中,挑战主要集中于高密度信息的高效捕获与表示,即如何准确记录并结构化教师模型在每个生成步骤中的完整概率分布,同时确保数据规模与计算开销在可控范围内。此外,将复杂的概率分布转化为适用于小模型训练的监督信号,并保持数据格式的标准化与实用性,亦是构建过程中的关键难点。
常用场景
衍生相关工作
基于该数据集的概率蒸馏理念,学术界衍生出多项经典研究工作。例如在置信度校准生成领域,研究者利用对数概率分布训练小型模型输出置信度分数;在替代感知解码方向,该数据为改进波束搜索与采样策略提供了新思路。此外,其核心方法也被拓展至多语言模型蒸馏、低资源语言生成等方向,形成了以概率透明度为核心的小型模型优化技术体系。
数据集最近研究
最新研究方向
在自然语言处理领域,小型模型的高效训练已成为前沿热点,PRISM-K48-Gemma4.E2B数据集通过概率感知蒸馏技术,为紧凑模型的语言习得提供了新范式。该数据集不仅记录教师模型生成的响应,还捕获每个生成步骤中Top-K替代令牌及其对数概率,从而将信息密度提升至传统指令调优数据的K倍。这一方法使小型模型能够学习教师模型的决策分布,而非单一输出,显著加速了英语语法结构的收敛过程,并增强了低参数量下的生成置信度。当前研究聚焦于利用该数据集进行KL散度蒸馏和不确定性感知训练,以降低模型幻觉,提升开放域问答的鲁棒性。其透明化生成机制为小型模型的流畅性优化开辟了路径,推动了资源受限环境中语言模型的高效部署。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作