HPLT/2508-wds-evals
收藏Hugging Face2025-11-24 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/HPLT/2508-wds-evals
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
- config_name: fra_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 909987
num_examples: 3784
download_size: 33013
dataset_size: 909987
- config_name: spa_Latn
features:
- name: corpus
dtype: string
- name: category
dtype: string
- name: dataset
dtype: string
- name: task
dtype: string
- name: prompt
dtype: string
- name: model
dtype: string
- name: ckpt_num
dtype: int64
- name: score
dtype: float64
splits:
- name: results
num_bytes: 1511136
num_examples: 6912
download_size: 71232
dataset_size: 1511136
configs:
- config_name: fra_Latn
data_files:
- split: results
path: fra_Latn/results-*
- config_name: spa_Latn
data_files:
- split: results
path: spa_Latn/results-*
language:
- es
- fr
---
# HPLT 3.0: Details on WDS-based Sampling Evaluation Results
### Dataset Description
This dataset contains fine-grained results from our HPLT 3.0 release evaluations comparing the new HPLT 3.0 corpora sampled using different Web Document Scorer (WDS) thresholds, focusing on Spanish and French. We compare three configurations: `Top`, `Random`, and `Bottom`. `Random` sampling represents the default approach, drawing uniformly on the full corpus, while `Top` and `Bottom` take advantage of the sorting by WDS levels and sequentially draw 100B training tokens from either end of the corpus. We pretrain 2.2B Llama-style decoder models on 100B tokens for each selected language and evaluate them using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main), a multilingual evaluation framework for comprehensive multi-prompt *k*-shot evaluation across 124 tasks and 500+ prompts in nine typologically diverse languages: Spanish (`spa_Latn`), French (`fra_Latn`), Czech (`ces_Latn`), Ukrainian (`ukr_Cyrl`), Finnish (`fin_Latn`), Catalan (`cat_Latn`), Galician (`glg_Latn`), Basque (`eus_Latn`), and Norwegian (Bokmål and Nynorsk; `nor_Latn`).
- **Curated by:** [High Performance Language Technologies (HPLT)](https://hplt-project.org)
- **Languages:** Spanish and French
- **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066)
- **Repository:** [github.com/hplt-project/hplt-e](https://github.com/hplt-project/hplt-e/tree/main)
- **License:** Apache 2.0
Please find more details in our paper and GitHub repository.
## Uses
This dataset is intended for reproducibility and research purposes. Find an example on how to access the results:
```python
from datasets import load_dataset
dataset = load_dataset("HPLT/2508-wds-evals", "spa_Latn", split="results").to_pandas()
```
## Dataset Structure
### Dataset Instances
Each dataset instance looks as follows:
```python
{
'corpus': 'Bottom',
'category': 'Paraphrase detection',
'dataset': 'paws_es',
'task': 'paws_es_p2',
'prompt': 'Oración 1: {{sentence1}}\nOración 2: {{sentence2}}\nPregunta: ¿Las oraciones 1 y 2 expresan el mismo significado? ¿Sí o no?\nRespuesta:',
'model': '50B',
'ckpt_num': 24000,
'score': 45.35
}
```
### Dataset Fields
- `corpus`: corpus name (`Top`, `Random`, `Bottom`)
- `category`: task category
- `dataset`: evaluation dataset name
- `task`: evaluation task (refers to a specific prompt)
- `prompt`: prompt used for evaluation
- `model`: number of pretraining tokens (B)
- `ckpt_num`: number identifier for `model`
- `score`: standard metric performance score
## Cite Us
```
@article{oepen2025hplt,
title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
journal={arXiv preprint arXiv:2511.01066},
year={2025}
}
```
## Contact Us
* Vladislav Mikhailov [vladism@ifi.uio.no](vladism@ifi.uio.no)
* Stephan Oepen [oe@ifi.uio.no](oe@ifi.uio.no)
提供机构:
HPLT



