gplsi/uji_parallel_va_en

Name: gplsi/uji_parallel_va_en
Creator: gplsi
Published: 2026-03-09 12:25:18
License: 暂无描述

Hugging Face2026-03-09 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/gplsi/uji_parallel_va_en

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - translation language: - va - en size_categories: - 100K<n<1M --- # UJI_PARALLEL_VA_EN Dataset ## Dataset Summary **UJI_PARALLEL_VA_EN** is a parallel dataset for machine translation between **Valencian (VA)** and **English (EN)**. It consists of aligned sentence pairs along with the source file from which each pair was extracted. The dataset is intended for research in machine translation, cross-lingual NLP, and linguistic analysis. ## Dataset Structure Each row in the dataset includes the following fields: - **VA**: A sentence in Valencian. - **EN**: The corresponding English translation. - **Source**: The file from which the parallel sentence pair was obtained. ## Dataset Creation ### Curation Rationale This dataset is aimed at promoting the development of Machine Translation between Valencian and English, supporting research in multilingual NLP and facilitating the development of translation systems for these language pairs. ### Source Data The parallel data in this dataset is extracted from web news published by the [Universitat Jaume I (UJI)](https://www.uji.es/com/sumari/noticies/). All sentence pairs originate from publicly available articles on the university's official communication channels. ### Data Filtering and Normalization All data underwent rigorous filtering and normalization: - **Alignment filtering**: Sentence- and paragraph-level alignments were calculated with the [`gplsi translation-alignment` tool](https://github.com/gplsi/translation-alignment). - **Language identification**: Valencian documents are filtered using a private discriminative tool to differentiate them from Catalan. - **Deduplication:** The filtered datasets were deduplicated to remove redundant sentence pairs The filtered and normalized datasets were then concatenated to form the final corpus. ## Funding This dataset is funded by the *Ministerio para la Transformación Digital y de la Función Pública* — Funded by **EU – NextGenerationEU**, within the framework of the project *Desarrollo de Modelos ALIA*.  ## Reference Please cite this dataset using the following BibTeX entry: ```bibtex @misc{uji_parallel_va_en_2025, author = {Espinosa Zaragoza, Sergio and Sep{\'u}lveda Torres, Robiert and Mu{\~n}oz Guillena, Rafael and Consuegra-Ayala, Juan Pablo}, title = {UJI\_PARALLEL\_VA\_EN Dataset}, year = {2025}, institution = {Language and Information Systems Group (GPLSI) and Centro de Inteligencia Digital (CENID), University of Alicante (UA)}, howpublished = {\url{https://huggingface.co/datasets/gplsi/uji_parallel_va_en}} } ``` ## Disclaimer This dataset may contain biases or unintended artifacts. Any third party using or deploying systems based on this dataset is solely responsible for ensuring compliant, safe, and ethical use, including adherence to relevant AI and data protection regulations. The University of Alicante, as creator and owner of the dataset, assumes no liability for outcomes resulting from third-party use. ## License This work is licensed under a [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) licence.

提供机构：

gplsi

5,000+

优质数据集

54 个

任务类型

进入经典数据集