five

ORKG Comparison Small Language Models (SLMs)

收藏
DataCite Commons2024-10-21 更新2025-04-16 收录
下载链接:
https://orkg.org/comparison/R753434
下载链接
链接失效反馈
官方服务:
资源简介:
This comparison includes a state-of-the-art (SOTA), analysis and statistics of Small Language Models (SLMs), as of October 2024, based on the survey "SMALL LANGUAGE MODELS: SURVEY, MEASUREMENTS, AND INSIGHTS" by Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, and Mengwei Xu, focusing, and including a detailed table, presented in the article table, Table 1. The SLMs considered are transformer-based, decoder-only models with 100M–5B parameters designed for resource-efficient deployment on devices with constrained/low hardware requirements such as commodity laptops, high-end mobile devices such as smartphones and tablets, etc. This comparison analyzes 53 SOTA SLMs, focusing on their architectures, training datasets, and algorithms. The key insights provided in the article include the following: 1) SLM Architecture: Group-query attention, gated FFN with SiLU activation, RMS normalization, and larger vocabulary sizes are becoming standard. Innovations like parameter sharing and layer-wise parameter scaling impact runtime performance significantly. ​ 2) Training Datasets: Data quality outweighs quantity; model-based filtering improves dataset quality. Recent SLMs are "over-trained" on large amounts of tokens, exceeding the Chinchilla law's recommendations. ​ 3) Training Algorithms: Techniques like Maximal Update Parameterization, Knowledge Distillation, and Two-Stage Pre-training improve model capability. ​ 4) SLM Capabilities: SLMs have shown significant performance improvements from 2022 to 2024, closing the gap with LLMs. ​Larger models generally perform better, but smaller models can excel in specific tasks. ​Open-source datasets are closing the gap with closed-source datasets in commonsense tasks. ​ 5) In-context Learning: Most SLMs benefit from in-context learning, especially in complex tasks. ​Larger models exhibit stronger in-context learning capabilities. ​ 6) Runtime Cost: Model architecture impacts latency, especially during the prefill stage. ​Quantization benefits are more significant during the decode stage. ​GPU outperforms CPU significantly in the prefill phase. ​Context length is crucial for runtime memory usage, with KV cache taking up a large portion of memory. ​

本对比涵盖截至2024年10月的最先进模型(State-of-the-Art, SOTA)分析与小型语言模型(Small Language Models, SLMs)统计,依托Zhenyan Lu、Xiang Li、Dongqi Cai、Rongjie Yi、Fangming Liu、Xiwen Zhang、Nicholas D. Lane与Mengwei Xu发表的综述《SMALL LANGUAGE MODELS: SURVEY, MEASUREMENTS, AND INSIGHTS》展开,包含该论文表1中呈现的详细对比表格。本次研究涉及的小型语言模型均为基于Transformer架构的仅解码器模型,参数量介于1亿至50亿之间,旨在实现资源高效部署,适配硬件资源受限的设备,例如商用笔记本电脑、高端移动设备(智能手机、平板电脑等)。本对比共分析了53款最先进小型语言模型,重点关注其架构、训练数据集与训练算法。该综述提出的核心结论如下: 1) 小型语言模型架构:组查询注意力(Group-query attention)、搭载SiLU激活函数的门控前馈网络(gated FFN)、RMS归一化以及更大的词表尺寸正成为行业标准。参数共享与逐层参数缩放等创新技术,对模型运行时性能有着显著影响。 2) 训练数据集:数据质量优于数据数量;基于模型的过滤可有效提升数据集质量。近期的小型语言模型在大量Token上出现了“过训练”现象,其训练Token数量超出了Chinchilla定律的推荐阈值。 3) 训练算法:最大更新参数化(Maximal Update Parameterization)、知识蒸馏(Knowledge Distillation)以及两阶段预训练等技术,可有效提升模型性能。 4) 小型语言模型能力:2022年至2024年间,小型语言模型的性能实现了显著提升,与大语言模型(Large Language Model, LLM)之间的性能差距正在缩小。通常而言,参数量更大的模型性能更优,但小型模型也可在特定任务中表现出众。在常识任务领域,开源数据集与闭源数据集之间的性能差距正在逐步收窄。 5) 上下文学习(In-context Learning):多数小型语言模型可从上下文学习中获益,尤其在复杂任务中表现明显。参数量更大的模型,其上下文学习能力更强。 6) 运行时成本:模型架构对推理延迟有着显著影响,尤其在预填充(prefill)阶段。量化技术的收益在解码(decode)阶段更为显著。在预填充阶段,图形处理器(Graphics Processing Unit, GPU)的性能显著优于中央处理器(Central Processing Unit, CPU)。上下文长度对运行时内存占用至关重要,其中KV缓存(KV cache)占据了内存的较大比例。
提供机构:
Open Research Knowledge Graph
创建时间:
2024-10-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作