five

Polygl0t/portuguese-eval-logs-olmo2-smollm3

收藏
Hugging Face2026-03-05 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: Checkpoint dtype: string - name: Step dtype: int64 - name: ASSIN2 RTE dtype: float64 - name: ASSIN2 STS dtype: float64 - name: BLUEX dtype: float64 - name: ENEM dtype: float64 - name: FAQUAD NLI dtype: float64 - name: HateBR dtype: float64 - name: OAB dtype: float64 - name: PT Hate Speech dtype: float64 - name: TweetSentBR dtype: float64 - name: ARC Challenge dtype: float64 - name: ASSIN2 ENT dtype: float64 - name: ASSIN2 PAR dtype: float64 - name: BELEBELE dtype: float64 - name: CALAME dtype: float64 - name: Global PIQA dtype: float64 - name: HellaSwag dtype: float64 - name: LAMBADA dtype: float64 - name: MMLU dtype: float64 splits: - name: olmo2_1b num_bytes: 7039 num_examples: 38 - name: olmo2_7b num_bytes: 7915 num_examples: 43 - name: smollm3_3b num_bytes: 21346 num_examples: 122 download_size: 56104 dataset_size: 36300 configs: - config_name: default data_files: - split: smollm3_3b path: data/smollm3_3b-* - split: olmo2_1b path: data/olmo2_1b-* - split: olmo2_7b path: data/olmo2_7b-* license: apache-2.0 language: - pt --- # Evaluation Logs on Portuguese Benchmarks for OLMo-2 and SmolLM3 These logs contain benchmark results across a suite of Portuguese-language tasks. The data consists of recordings of the performance of various 3 different models at different checkpoints throughout their pretraining runs: - [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints) - [OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) - [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo-2-1124-7B) ## Splits Each split (`smollm3_3b`, `olmo2_1b`, `olmo2_7b`) contains rows for model checkpoints and columns for benchmark scores (e.g., ASSIN2 RTE, ENEM, BLUEX, OAB, etc.). ## Data Format The `checkpoint` column indicates the branch from which the checkpoint was taken (e.g., `stage1-step-40000`). `step` indicates the training step at which the checkpoint was saved. The remaining columns show the model's performance across various benchmarks. ```json { "Checkpoint": "stage1-step-40000", "Step": 40000, "ASSIN2 RTE": 0.5369919728357186, "ASSIN2 STS": 0.1356046511596823, "BLUEX": 0.2002781641168289, "ENEM": 0.1980405878236529, "FAQUAD NLI": 0.5373977569778228, "HateBR": 0.5875147596226018, "OAB": 0.2323462414578587, "PT Hate Speech": 0.5555178900597524, "TweetSentBR": 0.3303917299738176, "ARC Challenge": 0.288034188034188, "ASSIN2 ENT": 0.58075, "ASSIN2 PAR": 0.6365, "BELEBELE": 0.23, "CALAME": 0.5130057803468208, "Global PIQA": 0.65, "HellaSwag": 0.3650449669519991, "LAMBADA": 0.4723462060935377, "MMLU": 0.2284599219453617 } ``` ~100 billion tokens separate each checkpoint. Still, due to differences in checkpoint saving frequency and batch size, the actual token counts between checkpoints may vary slightly. ## How to Use ```python from datasets import load_dataset # Loads the SmolLM3 split ds = load_dataset("Polygl0t/portuguese-eval-logs-olmo2-smollm3", split="smollm3_3b") ``` ## Benchmarks The benchmarks are sourced from [lm-evaluation-harness](https://github.com/Polygl0t/lm-evaluation-harness) (branch: `polyglot_harness_portuguese`) and [lm-evaluation-harness-pt](https://github.com/eduagarcia/lm-evaluation-harness-pt), which provide a standardized set of Portuguese-language tasks for LLM evaluation. The following benchmarks are included in this log: - **ENEM**: Brazilian high-school exam, Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia/enem_challenge)). - **BLUEX**: University entrance exam questions (Unicamp/Fuvest), Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images)). - **OAB Exams**: Brazilian Bar Association exam questions, Q&A format ([dataset](https://huggingface.co/datasets/eduagarcia/oab_exams)). - **ASSIN2 RTE**: Textual entailment / natural language inference ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-rte)). - **ASSIN2 STS**: Semantic textual similarity ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-sts)). - **FAQUAD NLI**: Entailment task based on Portuguese reading comprehension ([dataset](https://huggingface.co/datasets/ruanchaves/faquad-nli)). - **HateBR**: Abusive language detection in Brazilian Portuguese social media ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark)). - **PT Hate Speech**: Hate speech detection in Portuguese tweets ([dataset](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/Portuguese_Hate_Speech_binary)). - **TweetSentBR**: Sentiment analysis on Brazilian Portuguese tweets ([dataset](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot)). - **ARC Challenge**: Multiple-choice grade-school science questions (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/ARC-poly)). - **ASSIN2 ENT**: Textual entailment (natural language inference), not generative ([dataset](https://huggingface.co/datasets/nilc-nlp/assin2)). - **ASSIN2 PAR**: Paraphrase detection from the ASSIN2 dataset ([dataset](https://huggingface.co/datasets/nilc-nlp/assin2)). - **BELEBELE**: Multilingual reading comprehension (Portuguese subset)([dataset](https://huggingface.co/datasets/facebook/belebele)). - **CALAME**: Predict the last word of a passage — Portuguese version (similar to LAMBADA) ([dataset](https://huggingface.co/datasets/Polygl0t/CALAME-PT)). - **Global PIQA**: Physical commonsense reasoning (Brazilian Portuguese subset) ([dataset](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel)). - **HellaSwag**: Commonsense inference (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/Hellaswag-poly)). - **LAMBADA**: Predict the last word of a passage (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/LAMBADA-poly)). - **MMLU**: Multitask language understanding (Portuguese translation) ([dataset](https://huggingface.co/datasets/Polygl0t/MMLU-poly)). ## Usage and Purpose - **Benchmark Analysis:** Track how model performance evolves during pretraining. - **Evaluation Research:** Assess the reliability and signal quality of different benchmarks. - **Model Comparison:** Compare Portuguese language understanding across different LLMs and training regimes. **E.g, SmolLM3-3B Performance on the ENEM benchmark across checkpoints:** ![SmolLM3-3B ENEM Performance](./.plots/smolLM3_ENEM.png) Other plots can be found in the [`.plots`](https://huggingface.co/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3/tree/main/.plots) directory. ### Citation Information ```latex @misc{correa2026tucano2cool, title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}}, author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ``` ### Acknowledgments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ## License All data in this dataset is licensed under the [Apache License 2.0](LICENSE).

dataset_info: features: - name: 检查点(Checkpoint) dtype: string - name: 训练步数(Step) dtype: int64 - name: ASSIN2 RTE dtype: float64 - name: ASSIN2 STS dtype: float64 - name: BLUEX dtype: float64 - name: ENEM dtype: float64 - name: FAQUAD NLI dtype: float64 - name: HateBR dtype: float64 - name: OAB dtype: float64 - name: 葡萄牙语仇恨言论检测(PT Hate Speech) dtype: float64 - name: TweetSentBR dtype: float64 - name: ARC挑战集(ARC Challenge) dtype: float64 - name: ASSIN2 ENT dtype: float64 - name: ASSIN2 PAR dtype: float64 - name: BELEBELE dtype: float64 - name: CALAME dtype: float64 - name: Global PIQA dtype: float64 - name: HellaSwag dtype: float64 - name: LAMBADA dtype: float64 - name: MMLU dtype: float64 splits: - name: olmo2_1b num_bytes: 7039 num_examples: 38 - name: olmo2_7b num_bytes: 7915 num_examples: 43 - name: smollm3_3b num_bytes: 21346 num_examples: 122 download_size: 56104 dataset_size: 36300 configs: - config_name: default data_files: - split: smollm3_3b path: data/smollm3_3b-* - split: olmo2_1b path: data/olmo2_1b-* - split: olmo2_7b path: data/olmo2_7b-* license: apache-2.0 language: - pt # 用于OLMo-2与SmolLM3的葡萄牙语基准测试评估日志 本日志收录了一系列葡萄牙语任务的基准测试结果,记录了三款不同模型在预训练全程不同检查点下的性能表现: - [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-checkpoints) - [OLMo-2-0425-1B](https://huggingface.co/allenai/OLMo-2-0425-1B) - [OLMo-2-1124-7B](https://huggingface.co/allenai/OLMo-2-1124-7B) ## 数据划分 每个数据划分(`smollm3_3b`、`olmo2_1b`、`olmo2_7b`)均包含模型检查点的行记录,以及对应基准测试分数的列字段(如ASSIN2 RTE、ENEM、BLUEX、OAB等)。 ## 数据格式 `checkpoint`列用于标识该检查点所属的训练分支(例如`stage1-step-40000`);`step`列表示保存该检查点时的训练迭代步数;其余列则展示了模型在各类基准测试上的性能得分。 json { "检查点(Checkpoint)": "stage1-step-40000", "训练步数(Step)": 40000, "ASSIN2 RTE": 0.5369919728357186, "ASSIN2 STS": 0.1356046511596823, "BLUEX": 0.2002781641168289, "ENEM": 0.1980405878236529, "FAQUAD NLI": 0.5373977569778228, "HateBR": 0.5875147596226018, "OAB": 0.2323462414578587, "葡萄牙语仇恨言论检测(PT Hate Speech)": 0.5555178900597524, "TweetSentBR": 0.3303917299738176, "ARC挑战集(ARC Challenge)": 0.288034188034188, "ASSIN2 ENT": 0.58075, "ASSIN2 PAR": 0.6365, "BELEBELE": 0.23, "CALAME": 0.5130057803468208, "Global PIQA": 0.65, "HellaSwag": 0.3650449669519991, "LAMBADA": 0.4723462060935377, "MMLU": 0.2284599219453617 } 两个检查点之间间隔约1000亿个Token,但由于检查点保存频率与批次大小存在差异,不同检查点间的实际Token计数可能存在小幅波动。 ## 使用方法 python from datasets import load_dataset # 加载SmolLM3对应的数据划分 ds = load_dataset("Polygl0t/portuguese-eval-logs-olmo2-smollm3", split="smollm3_3b") ## 基准测试集 本数据集的基准测试源自[lm-evaluation-harness](https://github.com/Polygl0t/lm-evaluation-harness)(分支:`polyglot_harness_portuguese`)与[lm-evaluation-harness-pt](https://github.com/eduagarcia/lm-evaluation-harness-pt),二者均提供了标准化的葡萄牙语大语言模型评估任务集。 本日志收录的基准测试包括: - **ENEM**:巴西高中升学考试问答数据集([数据集链接](https://huggingface.co/datasets/eduagarcia/enem_challenge))。 - **BLUEX**:巴西高校入学考试(坎皮纳斯大学/圣保罗大学)问答数据集([数据集链接](https://huggingface.co/datasets/eduagarcia-temp/BLUEX_without_images))。 - **OAB考试**:巴西律师资格考试问答数据集([数据集链接](https://huggingface.co/datasets/eduagarcia/oab_exams))。 - **ASSIN2 RTE**:文本蕴涵/自然语言推理任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-rte))。 - **ASSIN2 STS**:语义文本相似度任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/assin2-sts))。 - **FAQUAD NLI**:基于葡萄牙语阅读理解的蕴涵任务([数据集链接](https://huggingface.co/datasets/ruanchaves/faquad-nli))。 - **HateBR**:巴西葡萄牙语社交媒体辱骂内容检测任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark))。 - **PT Hate Speech**:葡萄牙语推文仇恨言论检测任务([数据集链接](https://huggingface.co/datasets/eduagarcia/portuguese_benchmark/viewer/Portuguese_Hate_Speech_binary))。 - **TweetSentBR**:巴西葡萄牙语推文情感分析任务([数据集链接](https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot))。 - **ARC挑战集**:葡萄牙语翻译版中小学科学多选问答任务([数据集链接](https://huggingface.co/datasets/Polygl0t/ARC-poly))。 - **ASSIN2 ENT**:非生成式文本蕴涵(自然语言推理)任务([数据集链接](https://huggingface.co/datasets/nilc-nlp/assin2))。 - **ASSIN2 PAR**:ASSIN2数据集的释义检测任务([数据集链接](https://huggingface.co/datasets/nilc-nlp/assin2))。 - **BELEBELE**:多语言阅读理解葡萄牙语子集([数据集链接](https://huggingface.co/datasets/facebook/belebele))。 - **CALAME**:葡萄牙语版段落末尾单词预测任务(与LAMBADA类似,[数据集链接](https://huggingface.co/datasets/Polygl0t/CALAME-PT))。 - **Global PIQA**:巴西葡萄牙语子集物理常识推理任务([数据集链接](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel))。 - **HellaSwag**:葡萄牙语翻译版常识推理任务([数据集链接](https://huggingface.co/datasets/Polygl0t/Hellaswag-poly))。 - **LAMBADA**:葡萄牙语翻译版段落末尾单词预测任务([数据集链接](https://huggingface.co/datasets/Polygl0t/LAMBADA-poly))。 - **MMLU**:葡萄牙语翻译版多任务语言理解任务([数据集链接](https://huggingface.co/datasets/Polygl0t/MMLU-poly))。 ## 用途与目标 - **基准测试分析**:追踪模型在预训练过程中的性能演化轨迹。 - **评估研究**:验证不同基准测试集的可靠性与信号质量。 - **模型对比**:对比不同大语言模型及训练方案在葡萄牙语理解能力上的差异。 **示例:SmolLM3-3B模型在ENEM基准测试上的跨检查点性能表现:** ![SmolLM3-3B ENEM Performance](./.plots/smolLM3_ENEM.png) 其余可视化图表可在[`.plots`](https://huggingface.co/datasets/Polygl0t/portuguese-eval-logs-olmo2-smollm3/tree/main/.plots)目录中获取。 ### 引用信息 latex @misc{correa2026tucano2cool, title={{图卡诺2酷:面向葡萄牙语的更优开源大语言模型}}, author={Nicholas Kluge Corrêa和Aniket Sen和Shiza Fatimah和Sophia Falk和Lennard Landgraf和Julia Kastner和Lucie Flek}, year={2026}, eprint={2603.03543}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03543}, } ### 致谢 Polyglot项目由德国联邦教育与研究部(BMBF)与北莱茵-威斯特伐利亚州文化与科学部(MWK)资助,作为“可持续未来(波恩大学)”项目及联邦与州政府卓越战略的一部分。 我们同时感谢波恩大学(University of Bonn)托管的[Marvin高性能计算集群](https://www.hpc.uni-bonn.de/en/systems/marvin)的访问权限,及其高性能计算与分析实验室提供的技术支持。 ## 许可证 本数据集的全部内容均采用[Apache许可证2.0](LICENSE)进行授权。
提供机构:
Polygl0t
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作