five

orai-nlp/SAMSUM-eu

收藏
Hugging Face2025-11-04 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/orai-nlp/SAMSUM-eu
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-nd-4.0 task_categories: - summarization language: - eu --- # SAMSUM-eu dataset for Summarization in Basque. SAMSUM-eu was created by automatically translating [SAMSum](https://aclanthology.org/D19-5409/), a human-annotated dialogue dataset for abstractive summarization, using a proprietary document-level MT system based on [Llama-eus-8B](https://huggingface.co/orai-nlp/Llama-eus-8B). We then filtered out examples with incomplete translations or non-Basque outputs. The translated test set was further refined by a native speaker to obtain 100 high-quality, manually curated test examples. In total, we obtained 11,313 training examples, 636 validation examples, and 100 manually curated test examples for evaluation. 📝 Paper: [Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque](https://aclanthology.org/2025.mrl-main.35/) accepted in [5TH MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP 2025](https://sigtyp.github.io/ws2025-mrl.html) (EMNLP) ## Acknowledgments The creation of this dataset has been partially funded by the Basque Government (ICL4LANG project, grant no. KK-2023/00094) and the European Union (EFA 104/01-LINGUATEC IA project, INTERREG POCTEFA 2021-2027 program). Finally, we thank Idoia Davila Uzkudun for her contributions to manual data curation and evaluation. ## Citation If you use this dataset please cite the following paper: ```bibtex @inproceedings{urbizu2025sub, title={Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for {B}asque}, author={Urbizu, Gorka and Corral, Ander and Saralegi, Xabier and San Vicente, I{\~n}aki}, booktitle={Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)}, pages={519--530}, year={2025} } ``` ## Contact - Gorka Urbizu (g.urbizu@orai.eus) - Xabier Saralegi (x.saralegi@orai.eus)
提供机构:
orai-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作