ORKG Comparison Small Language Models (SLMs)
收藏DataCite Commons2024-10-24 更新2025-04-16 收录
下载链接:
https://orkg.org/comparison/R753854
下载链接
链接失效反馈官方服务:
资源简介:
This comparison includes a state-of-the-art (SOTA), analysis and statistics of Small Language Models (SLMs), as of October 2024, based on the survey "SMALL LANGUAGE MODELS:
SURVEY, MEASUREMENTS, AND INSIGHTS" by Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, and Mengwei Xu, focusing, and including a detailed table, presented in the article table, Table 1. The SLMs considered are transformer-based, decoder-only models with 100M–5B parameters designed for resource-efficient deployment on devices with constrained/low hardware requirements such as commodity laptops, high-end mobile devices such as smartphones and tablets, etc. This comparison analyzes 53 SOTA SLMs, focusing on their architectures, training datasets, and algorithms. The key insights provided in the article include the following:
1) SLM Architecture: Group-query attention, gated FFN with SiLU activation, RMS normalization, and larger vocabulary sizes are becoming standard. Innovations like parameter sharing and layer-wise parameter scaling impact runtime performance significantly.
2) Training Datasets: Data quality outweighs quantity; model-based filtering improves dataset quality. Recent SLMs are "over-trained" on large amounts of tokens, exceeding the Chinchilla law's recommendations.
3) Training Algorithms: Techniques like Maximal Update Parameterization, Knowledge Distillation, and Two-Stage Pre-training improve model capability.
4) SLM Capabilities: SLMs have shown significant performance improvements from 2022 to 2024, closing the gap with LLMs. Larger models generally perform better, but smaller models can excel in specific tasks. Open-source datasets are closing the gap with closed-source datasets in commonsense tasks.
5) In-context Learning: Most SLMs benefit from in-context learning, especially in complex tasks. Larger models exhibit stronger in-context learning capabilities.
6) Runtime Cost: Model architecture impacts latency, especially during the prefill stage. Quantization benefits are more significant during the decode stage. GPU outperforms CPU significantly in the prefill phase. Context length is crucial for runtime memory usage, with KV cache taking up a large portion of memory.
提供机构:
Open Research Knowledge Graph
创建时间:
2024-10-24



