Polygl0t/portuguese-eval-logs-olmo2-smollm3

Name: Polygl0t/portuguese-eval-logs-olmo2-smollm3
Creator: Polygl0t
Published: 2026-03-05 09:02:04
License: 暂无描述

Hugging Face2026-03-05 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Checkpoint dtype: string - name: Step dtype: int64 - name: ASSIN2 RTE dtype: float64 - name: ASSIN2 STS dtype: float64 - name: BLUEX dtype: float64 - name: ENEM dtype: float64 - name: FAQUAD NLI dtype: float64 - name: HateBR dtype: float64 - name: OAB dtype: float64 - name: PT Hate Speech dtype: float64 - name: TweetSentBR dtype: float64 - name: ARC Challenge dtype: float64 - name: ASSIN2 ENT dtype: float64 - name: ASSIN2 PAR dtype: float64 - name: BELEBELE dtype: float64 - name: CALAME dtype: float64 - name: Global PIQA dtype: float64 - name: HellaSwag dtype: float64 - name: LAMBADA dtype: float64 - name: MMLU dtype: float64 splits: - name: olmo2_1b num_bytes: 7039 num_examples: 38 - name: olmo2_7b num_bytes: 7915 num_examples: 43 - name: smollm3_3b num_bytes: 21346 num_examples: 122 download_size: 56104 dataset_size: 36300 configs: - config_name: default data_files: - split: smollm3_3b path: data/smollm3_3b-* - split: olmo2_1b path: data/olmo2_1b-* - split: olmo2_7b path: data/olmo2_7b-* license: apache-2.0 language: - pt --- # Evaluation Logs on Portuguese Benchmarks for OLMo-2 and SmolLM3 These logs contain benchmark results across a suite of Portuguese-language tasks. The data consists of recordings of the performance of various 3 different models at different checkpoints throughout their pretraining runs: - [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints) - [OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) - [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo-2-1124-7B) ## Splits Each split (`smollm3_3b`, `olmo2_1b`, `olmo2_7b`) contains rows for model checkpoints and columns for benchmark scores (e.g., ASSIN2 RTE, ENEM, BLUEX, OAB, etc.). ## Data Format The `checkpoint` column indicates the branch from which the checkpoint was taken (e.g., `stage1-step-40000`). `step` indicates the training step at which the checkpoint was saved. The remaining columns show the model's performance across various benchmarks. ```json { "Checkpoint": "stage1-step-40000", "Step": 40000, "ASSIN2 RTE": 0.5369919728357186, "ASSIN2 STS": 0.1356046511596823, "BLUEX": 0.2002781641168289, "ENEM": 0.1980405878236529, "FAQUAD NLI": 0.5373977569778228, "HateBR": 0.5875147596226018, "OAB": 0.2323462414578587, "PT Hate Speech": 0.5555178900597524, "TweetSentBR": 0.3303917299738176, "ARC Challenge": 0.288034188034188, "ASSIN2 ENT": 0.58075, "ASSIN2 PAR": 0.6365, "BELEBELE": 0.23, "CALAME": 0.5130057803468208, "Global PIQA": 0.65, "HellaSwag": 0.3650449669519991, "LAMBADA": 0.4723462060935377, "MMLU": 0.2284599219453617 } ``` ~100 billion tokens separate each checkpoint. Still, due to differences in checkpoint saving frequency and batch size, the actual token counts between checkpoints may vary slightly. ## How to Use ```python from datasets import load_dataset # Loads the SmolLM3 split ds = load_dataset("Polygl0t/portuguese-eval-logs-olmo2-smollm3", split="smollm3_3b") ``` ## Benchmarks The benchmarks are sourced from [lm-evaluation-harness](https://github.com/Polygl0t/lm-evaluation-harness) (branch: `polyglot_harness_portuguese`) and [lm-evaluation-harness-pt](https://github.com/eduagarcia/lm-evaluation-harness-pt), which provide a standardized set of Portuguese-language tasks for LLM evaluation. The following benchmarks are included in this log: - **ENEM**: Brazilian high-school exam, Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia/enem_challenge)). - **BLUEX**: University entrance exam questions (Unicamp/Fuvest), Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images)). - **OAB Exams**: Brazilian Bar Association exam questions, Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia/oab_exams)). - **ASSIN2 RTE**: Textual entailment / natural language inference ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-rte)). - **ASSIN2 STS**: Semantic textual similarity ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-sts)). - **FAQUAD NLI**: Entailment task based on Portuguese reading comprehension ([dataset](https://huggingface.co/datasets/ruanchaves/faquad-nli)). - **HateBR**: Abusive language detection in Brazilian Portuguese social media ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark)). - **PT Hate Speech**: Hate speech detection in Portuguese tweets ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/Portuguese_Hate_Speech_binary)). - **TweetSentBR**: Sentiment analysis on Brazilian Portuguese tweets ([dataset](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot)). - **ARC Challenge**: Multiple-choice grade-school science questions (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/ARC-poly)). - **ASSIN2 ENT**: Textual entailment (natural language inference), not generative ([dataset](https://huggingface.co/datasets/nilc-nlp/assin2)). - **ASSIN2 PAR**: Paraphrase detection from the ASSIN2 dataset ([dataset](https://huggingface.co/datasets/nilc-nlp/assin2)). - **BELEBELE**: Multilingual reading comprehension (Portuguese subset)([dataset](https://huggingface.co/datasets/facebook/belebele)). - **CALAME**: Predict the last word of a passage — Portuguese version (similar to LAMBADA) ([dataset](https://huggingface.co/datasets/Polygl0t/CALAME-PT)). - **Global PIQA**: Physical commonsense reasoning (Brazilian Portuguese subset) ([dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel)). - **HellaSwag**: Commonsense inference (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/Hellaswag-poly)). - **LAMBADA**: Predict the last word of a passage (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/LAMBADA-poly)). - **MMLU**: Multitask language understanding (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/MMLU-poly)). ## Usage and Purpose - **Benchmark Analysis:** Track how model performance evolves during pretraining. - **Evaluation Research:** Assess the reliability and signal quality of different benchmarks. - **Model Comparison:** Compare Portuguese language understanding across different LLMs and training regimes. **E.g, SmolLM3-3B Performance on the ENEM benchmark across checkpoints:** ![SmolLM3-3B ENEM Performance](./.plots/smolLM3_ENEM.png) Other plots can be found in the [`.plots`](https://huggingface.co/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3/tree/main/.plots) directory. ### Citation Information ```latex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ## License All data in this dataset is licensed under the [Apache License 2.0](LICENSE).

dataset_info: features: - name: 检查点（Checkpoint） dtype: string - name: 训练步数（Step） dtype: int64 - name: ASSIN2 RTE dtype: float64 - name: ASSIN2 STS dtype: float64 - name: BLUEX dtype: float64 - name: ENEM dtype: float64 - name: FAQUAD NLI dtype: float64 - name: HateBR dtype: float64 - name: OAB dtype: float64 - name: 葡萄牙语仇恨言论检测（PT Hate Speech） dtype: float64 - name: TweetSentBR dtype: float64 - name: ARC挑战集（ARC Challenge） dtype: float64 - name: ASSIN2 ENT dtype: float64 - name: ASSIN2 PAR dtype: float64 - name: BELEBELE dtype: float64 - name: CALAME dtype: float64 - name: Global PIQA dtype: float64 - name: HellaSwag dtype: float64 - name: LAMBADA dtype: float64 - name: MMLU dtype: float64 splits: - name: olmo2_1b num_bytes: 7039 num_examples: 38 - name: olmo2_7b num_bytes: 7915 num_examples: 43 - name: smollm3_3b num_bytes: 21346 num_examples: 122 download_size: 56104 dataset_size: 36300 configs: - config_name: default data_files: - split: smollm3_3b path: data/smollm3_3b-* - split: olmo2_1b path: data/olmo2_1b-* - split: olmo2_7b path: data/olmo2_7b-* license: apache-2.0 language: - pt # 用于OLMo-2与SmolLM3的葡萄牙语基准测试评估日志本日志收录了一系列葡萄牙语任务的基准测试结果，记录了三款不同模型在预训练全程不同检查点下的性能表现： - [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints) - [OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) - [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo-2-1124-7B) ## 数据划分每个数据划分（`smollm3_3b`、`olmo2_1b`、`olmo2_7b`）均包含模型检查点的行记录，以及对应基准测试分数的列字段（如ASSIN2 RTE、ENEM、BLUEX、OAB等）。 ## 数据格式 `checkpoint`列用于标识该检查点所属的训练分支（例如`stage1-step-40000`）；`step`列表示保存该检查点时的训练迭代步数；其余列则展示了模型在各类基准测试上的性能得分。 json { "检查点（Checkpoint）": "stage1-step-40000", "训练步数（Step）": 40000, "ASSIN2 RTE": 0.5369919728357186, "ASSIN2 STS": 0.1356046511596823, "BLUEX": 0.2002781641168289, "ENEM": 0.1980405878236529, "FAQUAD NLI": 0.5373977569778228, "HateBR": 0.5875147596226018, "OAB": 0.2323462414578587, "葡萄牙语仇恨言论检测（PT Hate Speech）": 0.5555178900597524, "TweetSentBR": 0.3303917299738176, "ARC挑战集（ARC Challenge）": 0.288034188034188, "ASSIN2 ENT": 0.58075, "ASSIN2 PAR": 0.6365, "BELEBELE": 0.23, "CALAME": 0.5130057803468208, "Global PIQA": 0.65, "HellaSwag": 0.3650449669519991, "LAMBADA": 0.4723462060935377, "MMLU": 0.2284599219453617 } 两个检查点之间间隔约1000亿个Token，但由于检查点保存频率与批次大小存在差异，不同检查点间的实际Token计数可能存在小幅波动。 ## 使用方法 python from datasets import load_dataset # 加载SmolLM3对应的数据划分 ds = load_dataset("Polygl0t/portuguese-eval-logs-olmo2-smollm3", split="smollm3_3b") ## 基准测试集本数据集的基准测试源自[lm-evaluation-harness](https://github.com/Polygl0t/lm-evaluation-harness)（分支：`polyglot_harness_portuguese`）与[lm-evaluation-harness-pt](https://github.com/eduagarcia/lm-evaluation-harness-pt)，二者均提供了标准化的葡萄牙语大语言模型评估任务集。本日志收录的基准测试包括： - **ENEM**：巴西高中升学考试问答数据集（[数据集链接](https://huggingface.co/datasets/eduagarcia/enem_challenge)）。 - **BLUEX**：巴西高校入学考试（坎皮纳斯大学/圣保罗大学）问答数据集（[数据集链接](https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images)）。 - **OAB考试**：巴西律师资格考试问答数据集（[数据集链接](https://huggingface.co/datasets/eduagarcia/oab_exams)）。 - **ASSIN2 RTE**：文本蕴涵/自然语言推理任务（[数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-rte)）。 - **ASSIN2 STS**：语义文本相似度任务（[数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-sts)）。 - **FAQUAD NLI**：基于葡萄牙语阅读理解的蕴涵任务（[数据集链接](https://huggingface.co/datasets/ruanchaves/faquad-nli)）。 - **HateBR**：巴西葡萄牙语社交媒体辱骂内容检测任务（[数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark)）。 - **PT Hate Speech**：葡萄牙语推文仇恨言论检测任务（[数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/Portuguese_Hate_Speech_binary)）。 - **TweetSentBR**：巴西葡萄牙语推文情感分析任务（[数据集链接](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot)）。 - **ARC挑战集**：葡萄牙语翻译版中小学科学多选问答任务（[数据集链接](https://huggingface.co/datasets/Polygl0t/ARC-poly)）。 - **ASSIN2 ENT**：非生成式文本蕴涵（自然语言推理）任务（[数据集链接](https://huggingface.co/datasets/nilc-nlp/assin2)）。 - **ASSIN2 PAR**：ASSIN2数据集的释义检测任务（[数据集链接](https://huggingface.co/datasets/nilc-nlp/assin2)）。 - **BELEBELE**：多语言阅读理解葡萄牙语子集（[数据集链接](https://huggingface.co/datasets/facebook/belebele)）。 - **CALAME**：葡萄牙语版段落末尾单词预测任务（与LAMBADA类似，[数据集链接](https://huggingface.co/datasets/Polygl0t/CALAME-PT)）。 - **Global PIQA**：巴西葡萄牙语子集物理常识推理任务（[数据集链接](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel)）。 - **HellaSwag**：葡萄牙语翻译版常识推理任务（[数据集链接](https://huggingface.co/datasets/Polygl0t/Hellaswag-poly)）。 - **LAMBADA**：葡萄牙语翻译版段落末尾单词预测任务（[数据集链接](https://huggingface.co/datasets/Polygl0t/LAMBADA-poly)）。 - **MMLU**：葡萄牙语翻译版多任务语言理解任务（[数据集链接](https://huggingface.co/datasets/Polygl0t/MMLU-poly)）。 ## 用途与目标 - **基准测试分析**：追踪模型在预训练过程中的性能演化轨迹。 - **评估研究**：验证不同基准测试集的可靠性与信号质量。 - **模型对比**：对比不同大语言模型及训练方案在葡萄牙语理解能力上的差异。 **示例：SmolLM3-3B模型在ENEM基准测试上的跨检查点性能表现：** ![SmolLM3-3B ENEM Performance](./.plots/smolLM3_ENEM.png) 其余可视化图表可在[`.plots`](https://huggingface.co/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3/tree/main/.plots)目录中获取。 ### 引用信息 latex @misc{correa2026tucano2cool, title={{图卡诺2酷：面向葡萄牙语的更优开源大语言模型}}, author={Nicholas Kluge Corrêa和Aniket Sen和Shiza Fatimah和Sophia Falk和Lennard Landgraf和Julia Kastner和Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ### 致谢 Polyglot项目由德国联邦教育与研究部（BMBF）与北莱茵-威斯特伐利亚州文化与科学部（MWK）资助，作为“可持续未来（波恩大学）”项目及联邦与州政府卓越战略的一部分。我们同时感谢波恩大学（University of Bonn）托管的[Marvin高性能计算集群](https://www.hpc.uni-bonn.de/en/systems/marvin)的访问权限，及其高性能计算与分析实验室提供的技术支持。 ## 许可证本数据集的全部内容均采用[Apache许可证2.0](LICENSE)进行授权。

提供机构：

Polygl0t

5,000+

优质数据集

54 个

任务类型

进入经典数据集