five

gplsi/uji_parallel_va_en

收藏
Hugging Face2026-03-09 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/gplsi/uji_parallel_va_en
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - translation language: - va - en size_categories: - 100K<n<1M --- # UJI_PARALLEL_VA_EN Dataset ## Dataset Summary **UJI_PARALLEL_VA_EN** is a parallel dataset for machine translation between **Valencian (VA)** and **English (EN)**. It consists of aligned sentence pairs along with the source file from which each pair was extracted. The dataset is intended for research in machine translation, cross-lingual NLP, and linguistic analysis. ## Dataset Structure Each row in the dataset includes the following fields: - **VA**: A sentence in Valencian. - **EN**: The corresponding English translation. - **Source**: The file from which the parallel sentence pair was obtained. ## Dataset Creation ### Curation Rationale This dataset is aimed at promoting the development of Machine Translation between Valencian and English, supporting research in multilingual NLP and facilitating the development of translation systems for these language pairs. ### Source Data The parallel data in this dataset is extracted from web news published by the [Universitat Jaume I (UJI)](https://www.uji.es/com/sumari/noticies/). All sentence pairs originate from publicly available articles on the university's official communication channels. ### Data Filtering and Normalization All data underwent rigorous filtering and normalization: - **Alignment filtering**: Sentence- and paragraph-level alignments were calculated with the [`gplsi translation-alignment` tool](https://github.com/gplsi/translation-alignment). - **Language identification**: Valencian documents are filtered using a private discriminative tool to differentiate them from Catalan. - **Deduplication:** The filtered datasets were deduplicated to remove redundant sentence pairs The filtered and normalized datasets were then concatenated to form the final corpus. ## Funding This dataset is funded by the *Ministerio para la Transformación Digital y de la Función Pública* — Funded by **EU – NextGenerationEU**, within the framework of the project *Desarrollo de Modelos ALIA*. <!-- ## Acknowledgments We extend our gratitude to all individuals and institutions that contributed to the development of this resource. Special thanks to: - [Data providers] - [Technological support providers] We also acknowledge the financial, scientific, and technical contributions of the *Ministerio para la Transformación Digital y de la Función Pública – Funded by EU – NextGenerationEU* within the framework of the *Desarrollo de Modelos ALIA* project. --> ## Reference Please cite this dataset using the following BibTeX entry: ```bibtex @misc{uji_parallel_va_en_2025, author = {Espinosa Zaragoza, Sergio and Sep{\'u}lveda Torres, Robiert and Mu{\~n}oz Guillena, Rafael and Consuegra-Ayala, Juan Pablo}, title = {UJI\_PARALLEL\_VA\_EN Dataset}, year = {2025}, institution = {Language and Information Systems Group (GPLSI) and Centro de Inteligencia Digital (CENID), University of Alicante (UA)}, howpublished = {\url{https://huggingface.co/datasets/gplsi/uji_parallel_va_en}} } ``` ## Disclaimer This dataset may contain biases or unintended artifacts. Any third party using or deploying systems based on this dataset is solely responsible for ensuring compliant, safe, and ethical use, including adherence to relevant AI and data protection regulations. The University of Alicante, as creator and owner of the dataset, assumes no liability for outcomes resulting from third-party use. ## License This work is licensed under a [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) licence.
提供机构:
gplsi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作