five

HPLT/2508-wds-evals

收藏
Hugging Face2025-11-24 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/HPLT/2508-wds-evals
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: - config_name: fra_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 909987 num_examples: 3784 download_size: 33013 dataset_size: 909987 - config_name: spa_Latn features: - name: corpus dtype: string - name: category dtype: string - name: dataset dtype: string - name: task dtype: string - name: prompt dtype: string - name: model dtype: string - name: ckpt_num dtype: int64 - name: score dtype: float64 splits: - name: results num_bytes: 1511136 num_examples: 6912 download_size: 71232 dataset_size: 1511136 configs: - config_name: fra_Latn data_files: - split: results path: fra_Latn/results-* - config_name: spa_Latn data_files: - split: results path: spa_Latn/results-* language: - es - fr --- # HPLT 3.0: Details on WDS-based Sampling Evaluation Results ### Dataset Description This dataset contains fine-grained results from our HPLT 3.0 release evaluations comparing the new HPLT 3.0 corpora sampled using different Web Document Scorer (WDS) thresholds, focusing on Spanish and French. We compare three configurations: `Top`, `Random`, and `Bottom`. `Random` sampling represents the default approach, drawing uniformly on the full corpus, while `Top` and `Bottom` take advantage of the sorting by WDS levels and sequentially draw 100B training tokens from either end of the corpus. We pretrain 2.2B Llama-style decoder models on 100B tokens for each selected language and evaluate them using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main), a multilingual evaluation framework for comprehensive multi-prompt *k*-shot evaluation across 124 tasks and 500+ prompts in nine typologically diverse languages: Spanish (`spa_Latn`), French (`fra_Latn`), Czech (`ces_Latn`), Ukrainian (`ukr_Cyrl`), Finnish (`fin_Latn`), Catalan (`cat_Latn`), Galician (`glg_Latn`), Basque (`eus_Latn`), and Norwegian (Bokmål and Nynorsk; `nor_Latn`). - **Curated by:** [High Performance Language Technologies (HPLT)](https://hplt-project.org) - **Languages:** Spanish and French - **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066) - **Repository:** [github.com/hplt-project/hplt-e](https://github.com/hplt-project/hplt-e/tree/main) - **License:** Apache 2.0 Please find more details in our paper and GitHub repository. ## Uses This dataset is intended for reproducibility and research purposes. Find an example on how to access the results: ```python from datasets import load_dataset dataset = load_dataset("HPLT/2508-wds-evals", "spa_Latn", split="results").to_pandas() ``` ## Dataset Structure ### Dataset Instances Each dataset instance looks as follows: ```python { 'corpus': 'Bottom', 'category': 'Paraphrase detection', 'dataset': 'paws_es', 'task': 'paws_es_p2', 'prompt': 'Oración 1: {{sentence1}}\nOración 2: {{sentence2}}\nPregunta: ¿Las oraciones 1 y 2 expresan el mismo significado? ¿Sí o no?\nRespuesta:', 'model': '50B', 'ckpt_num': 24000, 'score': 45.35 } ``` ### Dataset Fields - `corpus`: corpus name (`Top`, `Random`, `Bottom`) - `category`: task category - `dataset`: evaluation dataset name - `task`: evaluation task (refers to a specific prompt) - `prompt`: prompt used for evaluation - `model`: number of pretraining tokens (B) - `ckpt_num`: number identifier for `model` - `score`: standard metric performance score ## Cite Us ``` @article{oepen2025hplt, title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models}, author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others}, journal={arXiv preprint arXiv:2511.01066}, year={2025} } ``` ## Contact Us * Vladislav Mikhailov [vladism@ifi.uio.no](vladism@ifi.uio.no) * Stephan Oepen [oe@ifi.uio.no](oe@ifi.uio.no)
提供机构:
HPLT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作