Polygl0t/portuguese-eval-logs-olmo2-smollm3
收藏Hugging Face2026-03-05 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: Checkpoint
dtype: string
- name: Step
dtype: int64
- name: ASSIN2 RTE
dtype: float64
- name: ASSIN2 STS
dtype: float64
- name: BLUEX
dtype: float64
- name: ENEM
dtype: float64
- name: FAQUAD NLI
dtype: float64
- name: HateBR
dtype: float64
- name: OAB
dtype: float64
- name: PT Hate Speech
dtype: float64
- name: TweetSentBR
dtype: float64
- name: ARC Challenge
dtype: float64
- name: ASSIN2 ENT
dtype: float64
- name: ASSIN2 PAR
dtype: float64
- name: BELEBELE
dtype: float64
- name: CALAME
dtype: float64
- name: Global PIQA
dtype: float64
- name: HellaSwag
dtype: float64
- name: LAMBADA
dtype: float64
- name: MMLU
dtype: float64
splits:
- name: olmo2_1b
num_bytes: 7039
num_examples: 38
- name: olmo2_7b
num_bytes: 7915
num_examples: 43
- name: smollm3_3b
num_bytes: 21346
num_examples: 122
download_size: 56104
dataset_size: 36300
configs:
- config_name: default
data_files:
- split: smollm3_3b
path: data/smollm3_3b-*
- split: olmo2_1b
path: data/olmo2_1b-*
- split: olmo2_7b
path: data/olmo2_7b-*
license: apache-2.0
language:
- pt
---
# Evaluation Logs on Portuguese Benchmarks for OLMo-2 and SmolLM3
These logs contain benchmark results across a suite of Portuguese-language tasks. The data consists of recordings of the performance of various 3 different models at different checkpoints throughout their pretraining runs:
- [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints)
- [OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B)
- [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo-2-1124-7B)
## Splits
Each split (`smollm3_3b`, `olmo2_1b`, `olmo2_7b`) contains rows for model checkpoints and columns for benchmark scores (e.g., ASSIN2 RTE, ENEM, BLUEX, OAB, etc.).
## Data Format
The `checkpoint` column indicates the branch from which the checkpoint was taken (e.g., `stage1-step-40000`). `step` indicates the training step at which the checkpoint was saved. The remaining columns show the model's performance across various benchmarks.
```json
{
"Checkpoint": "stage1-step-40000",
"Step": 40000,
"ASSIN2 RTE": 0.5369919728357186,
"ASSIN2 STS": 0.1356046511596823,
"BLUEX": 0.2002781641168289,
"ENEM": 0.1980405878236529,
"FAQUAD NLI": 0.5373977569778228,
"HateBR": 0.5875147596226018,
"OAB": 0.2323462414578587,
"PT Hate Speech": 0.5555178900597524,
"TweetSentBR": 0.3303917299738176,
"ARC Challenge": 0.288034188034188,
"ASSIN2 ENT": 0.58075,
"ASSIN2 PAR": 0.6365,
"BELEBELE": 0.23,
"CALAME": 0.5130057803468208,
"Global PIQA": 0.65,
"HellaSwag": 0.3650449669519991,
"LAMBADA": 0.4723462060935377,
"MMLU": 0.2284599219453617
}
```
~100 billion tokens separate each checkpoint. Still, due to differences in checkpoint saving frequency and batch size, the actual token counts between checkpoints may vary slightly.
## How to Use
```python
from datasets import load_dataset
# Loads the SmolLM3 split
ds = load_dataset("Polygl0t/portuguese-eval-logs-olmo2-smollm3", split="smollm3_3b")
```
## Benchmarks
The benchmarks are sourced from [lm-evaluation-harness](https://github.com/Polygl0t/lm-evaluation-harness) (branch: `polyglot_harness_portuguese`) and [lm-evaluation-harness-pt](https://github.com/eduagarcia/lm-evaluation-harness-pt), which provide a standardized set of Portuguese-language tasks for LLM evaluation.
The following benchmarks are included in this log:
- **ENEM**: Brazilian high-school exam, Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia/enem_challenge)).
- **BLUEX**: University entrance exam questions (Unicamp/Fuvest), Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images)).
- **OAB Exams**: Brazilian Bar Association exam questions, Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia/oab_exams)).
- **ASSIN2 RTE**: Textual entailment / natural language inference ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-rte)).
- **ASSIN2 STS**: Semantic textual similarity ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-sts)).
- **FAQUAD NLI**: Entailment task based on Portuguese reading comprehension ([dataset](https://huggingface.co/datasets/ruanchaves/faquad-nli)).
- **HateBR**: Abusive language detection in Brazilian Portuguese social media ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark)).
- **PT Hate Speech**: Hate speech detection in Portuguese tweets ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/Portuguese_Hate_Speech_binary)).
- **TweetSentBR**: Sentiment analysis on Brazilian Portuguese tweets ([dataset](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot)).
- **ARC Challenge**: Multiple-choice grade-school science questions (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/ARC-poly)).
- **ASSIN2 ENT**: Textual entailment (natural language inference), not generative ([dataset](https://huggingface.co/datasets/nilc-nlp/assin2)).
- **ASSIN2 PAR**: Paraphrase detection from the ASSIN2 dataset ([dataset](https://huggingface.co/datasets/nilc-nlp/assin2)).
- **BELEBELE**: Multilingual reading comprehension (Portuguese subset)([dataset](https://huggingface.co/datasets/facebook/belebele)).
- **CALAME**: Predict the last word of a passage — Portuguese version (similar to LAMBADA) ([dataset](https://huggingface.co/datasets/Polygl0t/CALAME-PT)).
- **Global PIQA**: Physical commonsense reasoning (Brazilian Portuguese subset) ([dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel)).
- **HellaSwag**: Commonsense inference (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/Hellaswag-poly)).
- **LAMBADA**: Predict the last word of a passage (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/LAMBADA-poly)).
- **MMLU**: Multitask language understanding (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/MMLU-poly)).
## Usage and Purpose
- **Benchmark Analysis:** Track how model performance evolves during pretraining.
- **Evaluation Research:** Assess the reliability and signal quality of different benchmarks.
- **Model Comparison:** Compare Portuguese language understanding across different LLMs and training regimes.
**E.g, SmolLM3-3B Performance on the ENEM benchmark across checkpoints:**

Other plots can be found in the [`.plots`](https://huggingface.co/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3/tree/main/.plots) directory.
### Citation Information
```latex
@misc{correa2026tucano2cool,
title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}},
author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek},
year={2026},
eprint={2603.03543},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03543},
}
```
### Acknowledgments
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab.
## License
All data in this dataset is licensed under the [Apache License 2.0](LICENSE).
dataset_info:
features:
- name: 检查点(Checkpoint)
dtype: string
- name: 训练步数(Step)
dtype: int64
- name: ASSIN2 RTE
dtype: float64
- name: ASSIN2 STS
dtype: float64
- name: BLUEX
dtype: float64
- name: ENEM
dtype: float64
- name: FAQUAD NLI
dtype: float64
- name: HateBR
dtype: float64
- name: OAB
dtype: float64
- name: 葡萄牙语仇恨言论检测(PT Hate Speech)
dtype: float64
- name: TweetSentBR
dtype: float64
- name: ARC挑战集(ARC Challenge)
dtype: float64
- name: ASSIN2 ENT
dtype: float64
- name: ASSIN2 PAR
dtype: float64
- name: BELEBELE
dtype: float64
- name: CALAME
dtype: float64
- name: Global PIQA
dtype: float64
- name: HellaSwag
dtype: float64
- name: LAMBADA
dtype: float64
- name: MMLU
dtype: float64
splits:
- name: olmo2_1b
num_bytes: 7039
num_examples: 38
- name: olmo2_7b
num_bytes: 7915
num_examples: 43
- name: smollm3_3b
num_bytes: 21346
num_examples: 122
download_size: 56104
dataset_size: 36300
configs:
- config_name: default
data_files:
- split: smollm3_3b
path: data/smollm3_3b-*
- split: olmo2_1b
path: data/olmo2_1b-*
- split: olmo2_7b
path: data/olmo2_7b-*
license: apache-2.0
language:
- pt
# 用于OLMo-2与SmolLM3的葡萄牙语基准测试评估日志
本日志收录了一系列葡萄牙语任务的基准测试结果,记录了三款不同模型在预训练全程不同检查点下的性能表现:
- [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints)
- [OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B)
- [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo-2-1124-7B)
## 数据划分
每个数据划分(`smollm3_3b`、`olmo2_1b`、`olmo2_7b`)均包含模型检查点的行记录,以及对应基准测试分数的列字段(如ASSIN2 RTE、ENEM、BLUEX、OAB等)。
## 数据格式
`checkpoint`列用于标识该检查点所属的训练分支(例如`stage1-step-40000`);`step`列表示保存该检查点时的训练迭代步数;其余列则展示了模型在各类基准测试上的性能得分。
json
{
"检查点(Checkpoint)": "stage1-step-40000",
"训练步数(Step)": 40000,
"ASSIN2 RTE": 0.5369919728357186,
"ASSIN2 STS": 0.1356046511596823,
"BLUEX": 0.2002781641168289,
"ENEM": 0.1980405878236529,
"FAQUAD NLI": 0.5373977569778228,
"HateBR": 0.5875147596226018,
"OAB": 0.2323462414578587,
"葡萄牙语仇恨言论检测(PT Hate Speech)": 0.5555178900597524,
"TweetSentBR": 0.3303917299738176,
"ARC挑战集(ARC Challenge)": 0.288034188034188,
"ASSIN2 ENT": 0.58075,
"ASSIN2 PAR": 0.6365,
"BELEBELE": 0.23,
"CALAME": 0.5130057803468208,
"Global PIQA": 0.65,
"HellaSwag": 0.3650449669519991,
"LAMBADA": 0.4723462060935377,
"MMLU": 0.2284599219453617
}
两个检查点之间间隔约1000亿个Token,但由于检查点保存频率与批次大小存在差异,不同检查点间的实际Token计数可能存在小幅波动。
## 使用方法
python
from datasets import load_dataset
# 加载SmolLM3对应的数据划分
ds = load_dataset("Polygl0t/portuguese-eval-logs-olmo2-smollm3", split="smollm3_3b")
## 基准测试集
本数据集的基准测试源自[lm-evaluation-harness](https://github.com/Polygl0t/lm-evaluation-harness)(分支:`polyglot_harness_portuguese`)与[lm-evaluation-harness-pt](https://github.com/eduagarcia/lm-evaluation-harness-pt),二者均提供了标准化的葡萄牙语大语言模型评估任务集。
本日志收录的基准测试包括:
- **ENEM**:巴西高中升学考试问答数据集([数据集链接](https://huggingface.co/datasets/eduagarcia/enem_challenge))。
- **BLUEX**:巴西高校入学考试(坎皮纳斯大学/圣保罗大学)问答数据集([数据集链接](https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images))。
- **OAB考试**:巴西律师资格考试问答数据集([数据集链接](https://huggingface.co/datasets/eduagarcia/oab_exams))。
- **ASSIN2 RTE**:文本蕴涵/自然语言推理任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-rte))。
- **ASSIN2 STS**:语义文本相似度任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-sts))。
- **FAQUAD NLI**:基于葡萄牙语阅读理解的蕴涵任务([数据集链接](https://huggingface.co/datasets/ruanchaves/faquad-nli))。
- **HateBR**:巴西葡萄牙语社交媒体辱骂内容检测任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark))。
- **PT Hate Speech**:葡萄牙语推文仇恨言论检测任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/Portuguese_Hate_Speech_binary))。
- **TweetSentBR**:巴西葡萄牙语推文情感分析任务([数据集链接](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot))。
- **ARC挑战集**:葡萄牙语翻译版中小学科学多选问答任务([数据集链接](https://huggingface.co/datasets/Polygl0t/ARC-poly))。
- **ASSIN2 ENT**:非生成式文本蕴涵(自然语言推理)任务([数据集链接](https://huggingface.co/datasets/nilc-nlp/assin2))。
- **ASSIN2 PAR**:ASSIN2数据集的释义检测任务([数据集链接](https://huggingface.co/datasets/nilc-nlp/assin2))。
- **BELEBELE**:多语言阅读理解葡萄牙语子集([数据集链接](https://huggingface.co/datasets/facebook/belebele))。
- **CALAME**:葡萄牙语版段落末尾单词预测任务(与LAMBADA类似,[数据集链接](https://huggingface.co/datasets/Polygl0t/CALAME-PT))。
- **Global PIQA**:巴西葡萄牙语子集物理常识推理任务([数据集链接](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel))。
- **HellaSwag**:葡萄牙语翻译版常识推理任务([数据集链接](https://huggingface.co/datasets/Polygl0t/Hellaswag-poly))。
- **LAMBADA**:葡萄牙语翻译版段落末尾单词预测任务([数据集链接](https://huggingface.co/datasets/Polygl0t/LAMBADA-poly))。
- **MMLU**:葡萄牙语翻译版多任务语言理解任务([数据集链接](https://huggingface.co/datasets/Polygl0t/MMLU-poly))。
## 用途与目标
- **基准测试分析**:追踪模型在预训练过程中的性能演化轨迹。
- **评估研究**:验证不同基准测试集的可靠性与信号质量。
- **模型对比**:对比不同大语言模型及训练方案在葡萄牙语理解能力上的差异。
**示例:SmolLM3-3B模型在ENEM基准测试上的跨检查点性能表现:**

其余可视化图表可在[`.plots`](https://huggingface.co/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3/tree/main/.plots)目录中获取。
### 引用信息
latex
@misc{correa2026tucano2cool,
title={{图卡诺2酷:面向葡萄牙语的更优开源大语言模型}},
author={Nicholas Kluge Corrêa和Aniket Sen和Shiza Fatimah和Sophia Falk和Lennard Landgraf和Julia Kastner和Lucie Flek},
year={2026},
eprint={2603.03543},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03543},
}
### 致谢
Polyglot项目由德国联邦教育与研究部(BMBF)与北莱茵-威斯特伐利亚州文化与科学部(MWK)资助,作为“可持续未来(波恩大学)”项目及联邦与州政府卓越战略的一部分。
我们同时感谢波恩大学(University of Bonn)托管的[Marvin高性能计算集群](https://www.hpc.uni-bonn.de/en/systems/marvin)的访问权限,及其高性能计算与分析实验室提供的技术支持。
## 许可证
本数据集的全部内容均采用[Apache许可证2.0](LICENSE)进行授权。
提供机构:
Polygl0t



