BSC-LT/Spanish-Valencian_Catalan_Parallel_Corpus
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/Spanish-Valencian_Catalan_Parallel_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- es
- ca
multilinguality:
- multilingual
pretty_name: Spanish-Valencian Catalan Parallel Corpus
task_categories:
- translation
license: cc-by-nc-4.0
---
# Dataset Card for Spanish-Valencian Catalan Parallel Corpus
## Dataset Description
- **Point of Contact:** langtech@bsc.es
### Dataset Summary
A bilingual parallel corpus containing parallel sentences in Spanish and the Valencian variant of Catalan. Built by aggregating and filtering multiple public sources, along with data obtained through direct data sharing with external partners, it provides sentence-level alignments for training Machine Translation systems. The dataset includes both authentically parallel data as well as synthetic Spanish translations generated from Valencian Catalan monolingual data using [SalamandraTA 7B Instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct).
### Supported Tasks and Leaderboards
The dataset is primarily designed for Machine Translation between Spanish and Valencian Catalan. Typical uses include supervised MT training, fine-tuning multilingual models, and data augmentation.
### Languages
The dataset contains one parallel language pair: Spanish-Valencian Catalan (es-ca_va), totaling 2,162,451 sentence pairs.
| Language pair | Codes | Size (sentences)
|-------------------|-------|------------------
| Spanish-Valencian Catalan | es-ca_va | 2,162,451
### The Valencian Catalan variant ###
The Valencian variant of Catalan is spoken primarily in the Valencian Community (Spain) and, to a lesser extent, in the comarca of El Carche in the Region of Murcia. Recognized as the co-official language of the Valencian Community alongside Spanish since 1982, Valencian is regulated by the Acadèmia Valenciana de la Llengua (AVL), the official normative institution for the language. Linguistically classified as part of the Western Catalan dialectal group, Valencian possesses distinctive phonetic, lexical, and morphological features that give it its own unique character.
## Dataset Structure
### Data Instances
The dataset is provided in parquet format. Each row contains a parallel sentence pair with the following structure:
```json
{
"l1_sentence": "Example sentence in first language",
"l2_sentence": "Example sentence in second language",
"l1": "es",
"l2": "ca_va"
}
```
### Data Fields
- `l1_sentence`: The sentence in the first language (string)
- `l2_sentence`: The parallel sentence in the second language (string)
- `l1`: ISO 639-1 code for Spanish (string)
- `l2`: specific language code for Valencian Catalan (string)
### Data Splits
The dataset contains a single split: `train`.
## Dataset Creation
### Curation Rationale
As a Catalan variant and co-official language of Spain, Valencian lacks official representation in the ISO 639 standard for language name codes, where only the generic code for Catalan (CA) is available. While Glottolog does provide a specific code for Valencian ("val"), in the NLP and digital AI resource landscape, the generic CA code is predominantly used. This creates a significant challenge: the vast majority of publicly available resources (datasets and language models) fail to distinguish between Catalan variants, resulting in data that mixes different varieties and consequently exhibits poor linguistic quality and specificity. Similarly, machine translation models often produce outputs that conflate various Catalan variants.
With this dataset, we aim to promote deeper research into these linguistic variants and contribute to improving the quality of machine translation systems. By providing textual data resources specifically focused on the Valencian variant, we seek to enable more precise and linguistically accurate NLP applications. For this purpose, we label our data with a specific code ("ca_va") to explicitly distinguishing Valencian from other Catalan varieties.
This dataset is therefore aimed at promoting the development of Machine Translation between Spanish and Valencian Catalan, supporting research in bilingual and multilingual NLP with proper linguistic granularity, and facilitating the development of translation systems that respect and preserve the unique characteristics of low-resource language varieties.
### Source Data
#### Initial Data Collection and Normalization
The corpus is a combination of authentic Spanish-Valencian Catalan parallel data and synthetic Spanish translations generated from Valencian Catalan monolingual data. Data was collected via direct data sharing agreements between the BSC and other parties, as well as from public web-based sources.
**Parallel source datasets:**
- [**BOUA parallel**](https://github.com/transducens/PILAR/tree/main/valencian/BOUA): parallel corpus created from the HTML version of the Butlletí Oficial Universitat d'Alacant (BOUA)
- [**DOGV parallel**](https://github.com/transducens/PILAR/tree/main/valencian/DOGV-html): parallel corpus created from the HTML version of the Diari Oficial de la Generalitat Valenciana (DOGV)
- [**BOUMH parallel**](https://github.com/transducens/PILAR/tree/main/valencian/BOUMH): parallel corpus created from the Boletín Oficial de la Universidad Miguel Hernández (BOUMH)
- [**Generalitat parallel**](https://github.com/transducens/PILAR/tree/main/valencian/Generalitat): parallel corpus extracted from different Valencian government-owned websites
**Monolingual source datasets:**
- **Les Corts Valencianes**: Data was collected from the transcriptions of the 2023 sessions of the Valencian Parliament, thanks to the [NEL-VIVES](https://vives.gplsi.es/) campaign, an initiative developed by [Cenid](https://cenid.es/), the Digital Intelligence Center of the University of Alicante.
**Synthetic Data Generation:**
For monolingual Valencian Catalan data, synthetic Spanish parallel data was created by translating using [SalamandraTA 7B Instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct).
**Data Filtering and Normalization:**
The data underwent minimal filtering due to data scarcity:
- **Normalization**: Text was minimally normalized using [Bifixer](https://github.com/bitextor/bifixer) to ensure consistency and quality.
- **Deduplication**: The filtered datasets were deduplicated to remove redundant sentence pairs.
The filtered and normalized datasets were then concatenated to form the final corpus.
#### Who are the source language producers?
- [Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant](http://transducens.dlsi.ua.es/)
- [Les Corts Valencianes](https://www.cortsvalencianes.es/): this data belongs to Les Corts Valencianes and is released in accordance with their [terms
of use](https://www.cortsvalencianes.es/ca-va/avis-legal).
### Annotations
#### Annotation process
The dataset does not contain any manual annotations beyond the parallel alignments, which were either preserved from source datasets or validated through automated alignment scoring.
#### Who are the annotators?
[N/A]
### Personal and Sensitive Information
Given that this dataset is derived from pre-existing datasets that may contain crawled data, and that no specific anonymisation process has been applied, personal and sensitive information may be present in the data. This needs to be considered when using the data for training models.
## Considerations for Using the Data
### Social Impact of Dataset
By providing this resource specifically focused on the Valencian variant of Catalan, we aim to address a critical gap in NLP resources for co-official languages of Spain. The conflation of linguistic variants under generic language codes (such as using CA for all Catalan varieties) has historically resulted in lower-quality NLP tools that fail to respect the unique characteristics of individual language varieties. This has a direct impact on speaker communities, as translation systems and language technologies that mix variants can produce outputs that are linguistically inaccurate or culturally inappropriate.
### Discussion of Biases
No specific bias mitigation strategies were applied to this dataset beyond deduplication and minimal quality filtering. Inherent biases may exist within the data, reflecting the biases present in the source datasets, which include web-crawled content, subtitles, news articles, and other user-generated or institutionally produced text. Users should be aware that the datasets contains synthetically generated Catalan text, which may reflect biases present in the translation model used.
### Other Known Limitations
The dataset contains almost exclusively data from the administrative and legal domains. Application of this dataset in a general domain or in other domains such as biomedical, technical, or other specialized fields would be of limited use. Additionally, the synthetic Spanish data may not achieve the same quality or naturalness as naturally parallel data.
## Additional Information
### Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es).
### Funding
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
### Acknowledgements
We gratefully acknowledge the following entities for their valuable contribution of data to this corpus:
- [Transducens, Departament de Llenguatges i Sistemes Informatics Universitat d'Alacant](https://github.com/transducens/PILAR)
- [Cenid, Digital Intelligence Center of the University of Alicante](https://cenid.es/)
- [Les Corts Valencianes](https://www.cortsvalencianes.es/)
### Licensing Information
This work is licensed under a [Creative Commons Attribution NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) licence.
### Citation Information
[N/A]
### Contributions
[N/A]
提供机构:
BSC-LT



