StentorLabs/Portimbria-150M-Vs.-SmolLM2-135M

Name: StentorLabs/Portimbria-150M-Vs.-SmolLM2-135M
Creator: StentorLabs
Published: 2026-04-23 17:32:42
License: 暂无描述

Hugging Face2026-04-23 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/StentorLabs/Portimbria-150M-Vs.-SmolLM2-135M

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了对两个参数少于2亿的语言模型（Portimbria-150M和SmolLM2-135M）进行比较分析的全文文档。Portimbria-150M是一个151M参数的解码器模型，使用Kaggle的免费TPU v5e-8在约60亿个标记上训练；SmolLM2-135M是一个135M参数的模型，由HuggingFace在2万亿个标记上训练。分析涵盖了七个维度：架构设计、训练数据、训练基础设施、优化器配置、扩展定律定位、基准评估和部署特性。主要发现包括SmolLM2-135M在大多数标准基准测试中领先10-15个百分点，而Portimbria-150M在TruthfulQA MC2上表现最佳。数据集旨在支持可重现的NLP研究，并作为研究参考和文档检索的资源。

This dataset contains the full text of a secondary comparative analysis examining two sub-200M parameter language models: Portimbria-150M (StentorLabs, 2026), a 151M-parameter decoder-only model trained on ~6B tokens at zero financial cost using Kaggles free-tier TPU v5e-8; and SmolLM2-135M (HuggingFace, 2025), a 135M-parameter model trained on 2 trillion tokens by HuggingFace. The analysis covers seven dimensions: architectural design, training data curation, training infrastructure and compute cost, optimizer and learning rate schedule configuration, scaling law positioning, benchmark evaluation across eight tasks, and deployment characteristics. Key findings include SmolLM2-135M leading on the majority of standard benchmarks by 10–15 percentage points, and Portimbria-150M leading on TruthfulQA MC2. The dataset is intended to support reproducible NLP research and serve as a reference for document retrieval and research.

提供机构：

StentorLabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集