HPLT/2505-deduplication-evals
收藏Hugging Face2025-11-24 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/HPLT/2505-deduplication-evals
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
- config_name: cat_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 1786251
num_examples: 8004
download_size: 99314
dataset_size: 1786251
- config_name: ces_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 2030164
num_examples: 8816
download_size: 81012
dataset_size: 2030164
- config_name: eus_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 1598144
num_examples: 5568
download_size: 58635
dataset_size: 1598144
- config_name: fin_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 2861252
num_examples: 11600
download_size: 134137
dataset_size: 2861252
- config_name: fra_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 770167
num_examples: 3129
download_size: 30495
dataset_size: 770167
- config_name: glg_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 784095
num_examples: 3480
download_size: 34565
dataset_size: 784095
- config_name: nor_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 2243240
num_examples: 8120
download_size: 99819
dataset_size: 2243240
- config_name: spa_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 1263252
num_examples: 5568
download_size: 60432
dataset_size: 1263252
- config_name: ukr_Cyrl
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 575270
num_examples: 1972
download_size: 17386
dataset_size: 575270
configs:
- config_name: cat_Latn
data_files:
- split: results
path: cat_Latn/results-*
- config_name: ces_Latn
data_files:
- split: results
path: ces_Latn/results-*
- config_name: eus_Latn
data_files:
- split: results
path: eus_Latn/results-*
- config_name: fin_Latn
data_files:
- split: results
path: fin_Latn/results-*
- config_name: fra_Latn
data_files:
- split: results
path: fra_Latn/results-*
- config_name: glg_Latn
data_files:
- split: results
path: glg_Latn/results-*
- config_name: nor_Latn
data_files:
- split: results
path: nor_Latn/results-*
- config_name: spa_Latn
data_files:
- split: results
path: spa_Latn/results-*
- config_name: ukr_Cyrl
data_files:
- split: results
path: ukr_Cyrl/results-*
language:
- es
- fr
- gl
- eu
- nb
- nn
- ca
- cs
- fi
- uk
---
# HPLT 3.0: Deduplication Strategy Comparison Results
### Dataset Description
This dataset contains fine-grained results from our HPLT 3.0 pre-release evaluations comparing different data deduplication stategies for the pre-HPLT 3.0 corpora with the previous HPLT 2.0 version. We compare the following data deduplication strategies to guide our design choices, and guard against data quality regression compared to HPLT 2.0: **pre-HPLT 3.0 CD** (per-crawl deduplication), **pre-HPLT 3.0 GD** (global deduplication), and **pre-HPLT 3.0 GDR** (global deduplication & rehydration). We pretrain 2.2B Llama-style decoder models on 30B tokens for each selected language and evaluate them using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main), a multilingual evaluation framework for comprehensive multi-prompt *k*-shot evaluation across 124 tasks and 500+ prompts in nine typologically diverse languages: Spanish (`spa_Latn`), French (`fra_Latn`), Czech (`ces_Latn`), Ukrainian (`ukr_Cyrl`), Finnish (`fin_Latn`), Catalan (`cat_Latn`), Galician (`glg_Latn`), Basque (`eus_Latn`), and Norwegian (Bokmål and Nynorsk; `nor_Latn`).
- **Curated by:** [High Performance Language Technologies (HPLT)](https://hplt-project.org)
- **Languages:** Spanish, French, Czech, Ukrainian, Finnish, Catalan, Galician, Basque, Norwegian Bokmål, and Norwegian Nynorsk
- **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066)
- **Repository:** [github.com/hplt-project/hplt-e](https://github.com/hplt-project/hplt-e/tree/main)
- **License:** Apache 2.0
Please find more details in our paper and GitHub repository.
## Uses
This dataset is intended for reproducibility and research purposes. Find an example on how to access the results:
```python
from datasets import load_dataset
dataset = load_dataset("HPLT/2505-deduplication-evals", "spa_Latn", split="results").to_pandas()
```
## Dataset Structure
### Dataset Instances
Each dataset instance looks as follows:
```python
{
'corpus': 'HPLT 2.0',
'category': 'Commonsense reasoning',
'dataset': 'xstorycloze_es',
'task': 'xstorycloze_es_p2',
'prompt': "{{[input_sentence_1, input_sentence_2, input_sentence_3, input_sentence_4] | join(' ') }}\n¿Qué ocurre después?\nA. {{ sentence_quiz1}} \nB. {{sentence_quiz2}}\nRespuesta:",
'model': '1B',
'ckpt_num': 500,
'score': 52.813}
}
```
### Dataset Fields
- `corpus`: corpus name (`pre-HPLT 3.0 CD`, `pre-HPLT 3.0 GD`, `pre-HPLT 3.0 GDR`, `HPLT 2.0`)
- `category`: task category
- `dataset`: evaluation dataset name
- `task`: evaluation task (refers to a specific prompt)
- `prompt`: prompt used for evaluation
- `model`: number of pretraining tokens (B)
- `ckpt_num`: number identifier for `model`
- `score`: standard metric performance score
## Cite Us
```
@article{oepen2025hplt,
title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
journal={arXiv preprint arXiv:2511.01066},
year={2025}
}
```
## Contact Us
* Vladislav Mikhailov [vladism@ifi.uio.no](vladism@ifi.uio.no)
* Stephan Oepen [oe@ifi.uio.no](oe@ifi.uio.no)
提供机构:
HPLT



