HPLT/2505-deduplication-evals

Name: HPLT/2505-deduplication-evals
Creator: HPLT
Published: 2025-11-24 15:52:20
License: 暂无描述

Hugging Face2025-11-24 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/HPLT/2505-deduplication-evals

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: - config_name: cat_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 1786251 num_examples: 8004 download_size: 99314 dataset_size: 1786251 - config_name: ces_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 2030164 num_examples: 8816 download_size: 81012 dataset_size: 2030164 - config_name: eus_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 1598144 num_examples: 5568 download_size: 58635 dataset_size: 1598144 - config_name: fin_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 2861252 num_examples: 11600 download_size: 134137 dataset_size: 2861252 - config_name: fra_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 770167 num_examples: 3129 download_size: 30495 dataset_size: 770167 - config_name: glg_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 784095 num_examples: 3480 download_size: 34565 dataset_size: 784095 - config_name: nor_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 2243240 num_examples: 8120 download_size: 99819 dataset_size: 2243240 - config_name: spa_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 1263252 num_examples: 5568 download_size: 60432 dataset_size: 1263252 - config_name: ukr_Cyrl features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 575270 num_examples: 1972 download_size: 17386 dataset_size: 575270 configs: - config_name: cat_Latn data_files: - split: results path: cat_Latn/results-* - config_name: ces_Latn data_files: - split: results path: ces_Latn/results-* - config_name: eus_Latn data_files: - split: results path: eus_Latn/results-* - config_name: fin_Latn data_files: - split: results path: fin_Latn/results-* - config_name: fra_Latn data_files: - split: results path: fra_Latn/results-* - config_name: glg_Latn data_files: - split: results path: glg_Latn/results-* - config_name: nor_Latn data_files: - split: results path: nor_Latn/results-* - config_name: spa_Latn data_files: - split: results path: spa_Latn/results-* - config_name: ukr_Cyrl data_files: - split: results path: ukr_Cyrl/results-* language: - es - fr - gl - eu - nb - nn - ca - cs - fi - uk --- # HPLT 3.0: Deduplication Strategy Comparison Results ### Dataset Description This dataset contains fine-grained results from our HPLT 3.0 pre-release evaluations comparing different data deduplication stategies for the pre-HPLT 3.0 corpora with the previous HPLT 2.0 version. We compare the following data deduplication strategies to guide our design choices, and guard against data quality regression compared to HPLT 2.0: **pre-HPLT 3.0 CD** (per-crawl deduplication), **pre-HPLT 3.0 GD** (global deduplication), and **pre-HPLT 3.0 GDR** (global deduplication & rehydration). We pretrain 2.2B Llama-style decoder models on 30B tokens for each selected language and evaluate them using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main), a multilingual evaluation framework for comprehensive multi-prompt *k*-shot evaluation across 124 tasks and 500+ prompts in nine typologically diverse languages: Spanish (`spa_Latn`), French (`fra_Latn`), Czech (`ces_Latn`), Ukrainian (`ukr_Cyrl`), Finnish (`fin_Latn`), Catalan (`cat_Latn`), Galician (`glg_Latn`), Basque (`eus_Latn`), and Norwegian (Bokmål and Nynorsk; `nor_Latn`). - **Curated by:** [High Performance Language Technologies (HPLT)](https://hplt-project.org) - **Languages:** Spanish, French, Czech, Ukrainian, Finnish, Catalan, Galician, Basque, Norwegian Bokmål, and Norwegian Nynorsk - **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066) - **Repository:** [github.com/hplt-project/hplt-e](https://github.com/hplt-project/hplt-e/tree/main) - **License:** Apache 2.0 Please find more details in our paper and GitHub repository. ## Uses This dataset is intended for reproducibility and research purposes. Find an example on how to access the results: ```python from datasets import load_dataset dataset = load_dataset("HPLT/2505-deduplication-evals", "spa_Latn", split="results").to_pandas() ``` ## Dataset Structure ### Dataset Instances Each dataset instance looks as follows: ```python { 'corpus': 'HPLT 2.0', 'category': 'Commonsense reasoning', 'dataset': 'xstorycloze_es', 'task': 'xstorycloze_es_p2', 'prompt': "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4] | join(' ') }}\n¿Qué ocurre después?\nA. {{ sentence_quiz1}} \nB. {{sentence_quiz2}}\nRespuesta:", 'model': '1B', 'ckpt_num': 500, 'score': 52.813} } ``` ### Dataset Fields - `corpus`: corpus name (`pre-HPLT 3.0 CD`, `pre-HPLT 3.0 GD`, `pre-HPLT 3.0 GDR`, `HPLT 2.0`) - `category`: task category - `dataset`: evaluation dataset name - `task`: evaluation task (refers to a specific prompt) - `prompt`: prompt used for evaluation - `model`: number of pretraining tokens (B) - `ckpt_num`: number identifier for `model` - `score`: standard metric performance score ## Cite Us ``` @article{oepen2025hplt, title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models}, author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others}, journal={arXiv preprint arXiv:2511.01066}, year={2025} } ``` ## Contact Us * Vladislav Mikhailov [vladism@ifi.uio.no](vladism@ifi.uio.no) * Stephan Oepen [oe@ifi.uio.no](oe@ifi.uio.no)

提供机构：

HPLT

5,000+

优质数据集

54 个

任务类型

进入经典数据集