orai-nlp/SAMSUM-eu
收藏Hugging Face2025-11-04 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/orai-nlp/SAMSUM-eu
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-nd-4.0
task_categories:
- summarization
language:
- eu
---
# SAMSUM-eu dataset for Summarization in Basque.
SAMSUM-eu was created by automatically translating [SAMSum](https://aclanthology.org/D19-5409/), a human-annotated dialogue dataset for abstractive summarization, using a proprietary document-level MT system based on [Llama-eus-8B](https://huggingface.co/orai-nlp/Llama-eus-8B). We then filtered out examples with incomplete translations
or non-Basque outputs. The translated test set was further refined by a native speaker to obtain 100 high-quality, manually curated test examples. In total, we obtained 11,313
training examples, 636 validation examples, and 100 manually curated test examples for evaluation.
📝 Paper: [Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque](https://aclanthology.org/2025.mrl-main.35/) accepted in [5TH MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP 2025](https://sigtyp.github.io/ws2025-mrl.html) (EMNLP)
## Acknowledgments
The creation of this dataset has been partially funded by the Basque Government (ICL4LANG project, grant no. KK-2023/00094) and the European Union (EFA 104/01-LINGUATEC IA project, INTERREG POCTEFA 2021-2027 program). Finally, we thank Idoia Davila Uzkudun for her contributions to manual data curation and evaluation.
## Citation
If you use this dataset please cite the following paper:
```bibtex
@inproceedings{urbizu2025sub,
title={Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for {B}asque},
author={Urbizu, Gorka and Corral, Ander and Saralegi, Xabier and San Vicente, I{\~n}aki},
booktitle={Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)},
pages={519--530},
year={2025}
}
```
## Contact
- Gorka Urbizu (g.urbizu@orai.eus)
- Xabier Saralegi (x.saralegi@orai.eus)
提供机构:
orai-nlp



