five

BSC-LT/Catalan-Aranese_Parallel_Corpus

收藏
Hugging Face2026-02-06 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/Catalan-Aranese_Parallel_Corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ca - oc multilinguality: - multilingual pretty_name: Catalan-Aranese Parallel Corpus size_categories: - 100M<n<1B task_categories: - translation license: cc-by-4.0 --- # Dataset Card for Catalan-Aranese Parallel Corpus ## Dataset Description - **Point of Contact:** langtech@bsc.es ### Dataset Summary A bilingual parallel corpus for the low-resource language pair Catalan-Aranese. Built by aggregating and filtering multiple public sources, along with data obtained through direct data sharing with external partners, it provides sentence-level alignments for training Machine Translation systems. The dataset includes both authentically parallel data as well as synthetic Catalan translations generated from Aranese monolingual data using [SalamandraTA 7B Instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct). ### Supported Tasks and Leaderboards The dataset is primarily designed for Machine Translation between Catalan and Aranese. Typical uses include supervised MT training, fine-tuning multilingual models, and data augmentation. ### Languages The dataset contains one parallel language pair: Catalan–Aranese (ca-oc_arn), totaling 539,110 sentence pairs. | Language pair | Codes | Size (sentences) |-------------------|-------|------------------ | Catalan-Aranese | ca-oc_arn | 539,110 ### The Aranese language ### Aranese is a variant of the Occitan language spoken in the Aran Valley, in the province of Lerida, Spain. The Occitan language belongs to the Romance or Neo-Latin language group and consists of six dialect groups: Vivaro-Alpine, Provençal, Limousin, Auvergnat, Languedocien and Gascon. Aranese is a variant of the Gascon dialect. According to the 1978 Statute of Catalonia, Aranese is subject to teaching and protection. The Law on Administrative Autonomy of the Aran Valley establishes that Aranese is a co-official language in the Aran Valley, along with Catalan and Spanish. In accordance with these regulations, Aranese is taught at all levels of compulsory education and is also used for communication among the public administrations and with the general public. ## Dataset Structure ### Data Instances The dataset is provided in parquet format. Each row contains a parallel sentence pair with the following structure: ```json { "l1_sentence": "Example sentence in first language", "l2_sentence": "Example sentence in second language", "l1": "ca", "l2": "oc_arn" } ``` ### Data Fields - `l1_sentence`: The sentence in the first language (string) - `l2_sentence`: The parallel sentence in the second language (string) - `l1`: ISO 639-1 code for Catalan (string) - `l2`: specific language code for Aranese (string) ### Data Splits The dataset contains a single split: `train`. ## Dataset Creation ### Curation Rationale As an extremely low-resource language, Aranese lacks official representation in the ISO 639 standard for language name codes, where only the generic code for Occitan (OC) is available. While some systems do provide specific codes for Aranese, such as Glottolog ("aran1260") and IETF ("oc-aranes"), in the NLP and digital AI resource landscape, the generic OC code is predominantly used. This creates a significant challenge: the vast majority of publicly available resources (datasets and language models) fail to distinguish between Occitan variants, resulting in data that mixes different varieties and consequently exhibits poor linguistic quality and specificity. Similarly, machine translation models often produce outputs that conflate various Occitan variants. With this dataset and other resources we are releasing, we aim to promote deeper research into these linguistic variants and contribute to improving the quality of machine translation systems. By providing textual data resources specifically focused on the Aranese variant of Occitan, we seek to enable more precise and linguistically accurate NLP applications. For this purpose, we have adopted a specific code ("oc_arn") to label our data, explicitly distinguishing Aranese from other Occitan varieties. This dataset is therefore aimed at promoting the development of Machine Translation between Catalan and Aranese, supporting research in bilingual and multilingual NLP with proper linguistic granularity, and facilitating the development of translation systems that respect and preserve the unique characteristics of low-resource language varieties. ### Source Data #### Initial Data Collection and Normalization The corpus is a combination of authentic Catalan-Aranese parallel data and synthetic Catalan translations generated from Aranese monolingual data. Data was collected via direct data sharing agreements between the BSC and other parties, as well as from public web-based sources. **Bilingual source datasets:** - **DOGC**: Parallel text extracted from [Diari Oficial de la Generalitat de Catalunya](https://dogc.gencat.cat/ca/inici/) - **JS Translations**: Aranese-Catalan translations produced by a professional translator and obtained through direct data sharing - Parallel text collected from several public web-based sources **Monolingual source datasets:** - **Pilar**: Monolingual Aranese [Pan-Iberian Language Archival Resource corpus](https://github.com/transducens/PILAR/tree/main/aranese) from literary and crawled domains produced by the research group Transducens from the University of Alicante - **IEA**: Monolingual Aranese collection of documents from the [Institute of Aranese Studies - Acadèmia Aranesa de la Léngua Occitan (IEA-AALO)](http://www.institutestudisaranesi.cat) - **Escaletas TV3**: Monolingual Aranese text extracted from the plots for the Aranese daily broadcast from Televisió de Catalunya's TV3 channel - Aranese texts collected from several public web-based sources **Synthetic Data Generation:** For monolingual Aranese data, synthetic Catalan parallel data was created by translating using [SalamandraTA 7B Instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct). The following monolingual datasets were translated to provide the Catalan side: | Dataset | Size (sentences) |-------------------|------------------ | PILAR & IEA | 392,300 | Public web sources| 48,536 | Escaletas TV3 | 2,944 | **Total** | **443,780** **Data Filtering and Normalization:** The data underwent minimal filtering due to data scarcity: - **Normalization**: Text was minimally normalized using [Bifixer](https://github.com/bitextor/bifixer) to ensure consistency and quality. - **Deduplication**: The filtered datasets were deduplicated to remove redundant sentence pairs. The filtered and normalized datasets were then concatenated to form the final corpus. #### Who are the source language producers? - Jordi Suïls Subirà - Universitat de Lleida (UdL) - [Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant](https://github.com/transducens/PILAR) - [Corporació Catalana de Mitjans Audiovisuals - 3Cat](https://www.3cat.cat/corporatiu/ca/el-grup/) - [Lo Congrès permanent de la lenga occitana](https://locongres.org/) - [Acadèmia Aranesa de la Léngua Occitan (IEA-AALO)](http://www.institutestudisaranesi.cat) - [Consorcio del proyecto Linguatec-IA](https://linguatec-ia.eu/consorcio/) ### Annotations #### Annotation process The dataset does not contain any manual annotations beyond the parallel alignments, which were either preserved from source datasets or validated through automated alignment scoring. #### Who are the annotators? [N/A] ### Personal and Sensitive Information Given that this dataset is derived from pre-existing datasets that may contain crawled data, and that no specific anonymisation process has been applied, personal and sensitive information may be present in the data. This needs to be considered when using the data for training models. ## Considerations for Using the Data ### Social Impact of Dataset By providing this resource specifically focused on the Aranese variant of Occitan, we aim to address a critical gap in NLP resources for extremely low-resource languages. The conflation of linguistic variants under generic language codes (such as using OC for all Occitan varieties) has historically resulted in lower-quality NLP tools that fail to respect the unique characteristics of individual language varieties. This has a direct impact on speaker communities, as translation systems and language technologies that mix variants can produce outputs that are linguistically inaccurate or culturally inappropriate. Furthermore, by making high-quality Aranese data publicly available, we enable researchers and developers to create technologies that better serve minority language communities, respecting their linguistic identity and contributing to the vitality and continued use of Aranese in digital contexts. ### Discussion of Biases No specific bias mitigation strategies were applied to this dataset beyond deduplication and minimal quality filtering. Inherent biases may exist within the data, reflecting the biases present in the source datasets, which include web-crawled content, subtitles, news articles, and other user-generated or institutionally produced text. Users should be aware that the datasets contains synthetically generated Catalan text, which may reflect biases present in the translation model used. ### Other Known Limitations The dataset contains predominantly data from the administrative and legal domains, as well as news articles. Application of this dataset in other domains such as biomedical, technical, or other specialized fields would be of limited use. Additionally, the synthetic Catalan data may not achieve the same quality or naturalness as naturally parallel data. ## Additional Information ### Dataset Curators Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es). ### Funding This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/). This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337. ### Acknowledgements We gratefully acknowledge the following individuals and organizations for their valuable contribution of data to this corpus: - Jordi Suïls Subirà - Universitat de Lleida (UdL) - [Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant](https://github.com/transducens/PILAR) - [Corporació Catalana de Mitjans Audiovisuals - 3Cat](https://www.3cat.cat/corporatiu/ca/el-grup/) - [Lo Congrès permanent de la lenga occitana](https://locongres.org/) - [Acadèmia Aranesa de la Léngua Occitan (IEA-AALO)](http://www.institutestudisaranesi.cat) - [Consorcio del proyecto Linguatec-IA](https://linguatec-ia.eu/consorcio/) ### Licensing Information This work is licensed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) licence. ### Citation Information [N/A] ### Contributions [N/A]

--- language: - 加泰罗尼亚语(Catalan) - 奥克语(Occitan) multilinguality: - 多语言 pretty_name: 加泰罗尼亚语-阿腊内塞语平行语料库 size_categories: - 1亿<句子数<10亿 task_categories: - 机器翻译 license: cc-by-4.0 --- # 加泰罗尼亚语-阿腊内塞语平行语料库数据集卡片 ## 数据集描述 - **联系人**:langtech@bsc.es ### 数据集概述 本数据集为面向低资源语言对加泰罗尼亚语-阿腊内塞语的双语平行语料库。通过整合、过滤多类公开数据源,并结合与外部合作方直接共享获取的数据构建而成,可提供句子级对齐数据,用于训练机器翻译(Machine Translation)系统。数据集既包含真实平行数据,也包含通过[SalamandraTA 7B Instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct)将阿腊内塞语单语数据翻译生成的合成加泰罗尼亚语译文。 ### 支持任务与排行榜 本数据集主要面向加泰罗尼亚语与阿腊内塞语之间的机器翻译任务,典型应用场景包括监督式机器翻译训练、多语言模型微调以及数据增强。 ### 语言覆盖 本数据集仅包含一组平行语言对:加泰罗尼亚语–阿腊内塞语(ca-oc_arn),总计539,110个句子对。 | 语言对 | 语言代码 | 句子数量 |-------------------|----------|---------- | 加泰罗尼亚语-阿腊内塞语 | ca-oc_arn | 539,110 ### 阿腊内塞语概况 阿腊内塞语是奥克语的分支变体,通行于西班牙莱里达省的阿兰谷地区。奥克语属于罗曼(新拉丁语)语族,包含六大方言分支:维瓦罗-阿尔卑斯方言、普罗旺斯方言、利穆赞方言、奥弗涅方言、朗格多克方言以及加斯科涅方言,而阿腊内塞语正是加斯科涅方言的分支变体。 根据1978年《加泰罗尼亚自治章程》,阿腊内塞语受到教学与保护相关的法定保障。《阿兰谷行政自治法》明确规定,阿腊内塞语与加泰罗尼亚语、西班牙语同为阿兰谷的官方通用语言。依据上述法规,阿腊内塞语已纳入全阶段义务教育课程,同时也用于公共行政部门内部及与公众的沟通场景。 ## 数据集结构 ### 数据样例 本数据集以Parquet格式存储。每一行对应一组平行句子对,结构如下: json { "l1_sentence": "Example sentence in first language", "l2_sentence": "Example sentence in second language", "l1": "ca", "l2": "oc_arn" } ### 数据字段 - `l1_sentence`:第一语言的句子(字符串类型) - `l2_sentence`:第二语言的平行句对(字符串类型) - `l1`:加泰罗尼亚语的ISO 639-1标准代码(字符串类型) - `l2`:阿腊内塞语的专属语言代码(字符串类型) ### 数据集划分 本数据集仅包含一个划分:`train`(训练集)。 ## 数据集构建 ### 构建初衷 阿腊内塞语属于极低资源语言,在ISO 639语言代码标准中尚未获得专属编码,仅可使用通用的奥克语代码(OC)。尽管部分工具如Glottolog(编码为`aran1260`)、IETF(编码为`oc-aranes`)提供了阿腊内塞语的专属代码,但在自然语言处理(Natural Language Processing, NLP)与人工智能数字资源生态中,通用OC代码仍被广泛使用。这带来了显著问题:绝大多数公开资源(数据集与大语言模型(Large Language Model, LLM))无法区分奥克语的不同分支变体,导致数据混杂多种变体,最终降低了语言处理结果的质量与针对性。类似地,现有机器翻译模型往往会混淆不同奥克语分支的表达。 依托本数据集及同期发布的其他资源,本项目旨在推动奥克语分支变体的深入研究,助力提升机器翻译系统的质量。通过提供专门面向阿腊内塞语分支的文本资源,我们希望支持更精准、符合语言学规范的自然语言处理应用。为此,本数据集采用专属代码`oc_arn`对阿腊内塞语进行标注,明确区分其与其他奥克语分支的差异。 本数据集旨在推动加泰罗尼亚语与阿腊内塞语之间的机器翻译技术发展,支持具备精准语言学粒度的双语及多语言自然语言处理研究,并助力开发能够尊重并保留低资源语言变体独特特征的翻译系统。 ### 源数据 #### 初始数据收集与标准化 本语料库由真实加泰罗尼亚语-阿腊内塞语平行数据,以及从阿腊内塞语单语数据生成的合成加泰罗尼亚语平行数据共同组成。数据来源包括巴塞罗那超级计算中心(BSC)与其他机构的直接数据共享协议,以及公开网络数据源。 **双语源数据集**: - **DOGC**:从[加泰罗尼亚政府官方公报](https://dogc.gencat.cat/ca/inici/)提取的平行文本 - **JS Translations**:由专业译员制作的阿腊内塞语-加泰罗尼亚语译文,通过直接数据共享获取 - 从多个公开网络数据源收集的平行文本 **单语源数据集**: - **Pilar**:阿腊内塞语单语语料库[泛伊比利亚语言档案资源(Pan-Iberian Language Archival Resource)](https://github.com/transducens/PILAR/tree/main/aranese),包含阿利坎特大学Transducens研究组采集的文学及网络爬取文本 - **IEA**:阿腊内塞语研究院-奥克语阿腊内塞语言学会(IEA-AALO)提供的阿腊内塞语单语文档集 - **Escaletas TV3**:从加泰罗尼亚电视台TV3的阿腊内塞语每日广播脚本中提取的单语文本 - 从多个公开网络数据源收集的阿腊内塞语文本 **合成数据生成**: 针对阿腊内塞语单语数据,本数据集使用[SalamandraTA 7B Instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct)进行翻译,生成对应的加泰罗尼亚语平行数据。以下单语数据集被翻译为加泰罗尼亚语侧数据: | 数据集名称 | 句子数量 |------------|---------- | PILAR & IEA | 392,300 | 公开网络数据源 | 48,536 | Escaletas TV3 | 2,944 | **总计** | **443,780** **数据过滤与标准化**: 由于数据资源稀缺,本数据集仅进行了最低限度的过滤处理: - **标准化**:使用[Bifixer](https://github.com/bitextor/bifixer)对文本进行轻量标准化,确保文本一致性与质量 - **去重**:对过滤后的数据集进行去重,移除冗余句子对 最终将经过过滤与标准化的数据集合并,形成本语料库。 #### 源语言生产者 - Jordi Suïls Subirà - 莱里达大学(Universitat de Lleida, UdL) - [阿利坎特大学语言与信息系统系 Transducens 研究组](https://github.com/transducens/PILAR) - [加泰罗尼亚视听传媒集团 - 3Cat](https://www.3cat.cat/corporatiu/ca/el-grup/) - [奥克语永久大会(Lo Congrès permanent de la lenga occitana)](https://locongres.org/) - [阿腊内塞语研究院-奥克语阿腊内塞语言学会(IEA-AALO)](http://www.institutestudisaranesi.cat) - [Linguatec-IA项目联合体](https://linguatec-ia.eu/consorcio/) ### 标注信息 #### 标注流程 本数据集除平行对齐信息外,未包含任何人工标注。平行对齐信息要么保留自原始数据集,要么通过自动化对齐评分进行验证。 #### 标注者 [N/A] ### 个人与敏感信息 由于本数据集源自已有的公开数据集(包含网络爬取内容),且未针对数据进行专门的匿名化处理,数据中可能包含个人或敏感信息。在使用该数据集训练模型时,需注意这一点。 ## 数据使用注意事项 ### 数据集的社会影响 本数据集专门面向奥克语的阿腊内塞语分支,旨在填补极低资源语言自然语言处理资源的关键缺口。此前,通用语言代码(如用OC指代所有奥克语分支)混用不同语言变体的做法,导致自然语言处理工具质量低下,无法尊重不同语言变体的独特特征,这直接影响了当地语言使用者群体:混用变体的翻译系统与语言技术可能产出不符合语言学规范或文化适配性差的结果。 通过公开高质量的阿腊内塞语数据集,我们希望帮助研究人员与开发者打造能够更好服务小众语言社区的技术,尊重其语言身份,助力阿腊内塞语在数字场景中的活力与持续使用。 ### 偏差讨论 本数据集仅进行了去重与轻量质量过滤,未采用专门的偏差缓解策略。数据中可能存在固有偏差,反映了源数据集的固有偏差——源数据集包含网络爬取内容、字幕、新闻文章及其他用户生成或机构制作的文本。使用者需注意,数据集中包含由翻译模型生成的合成加泰罗尼亚语文本,其可能带有翻译模型自身携带的偏差。 ### 其他已知局限性 本数据集主要包含行政、法律领域及新闻类文本,在生物医学、技术或其他专业领域的应用价值有限。此外,合成生成的加泰罗尼亚语数据可能无法达到真实平行数据的自然度与质量水平。 ## 附加信息 ### 数据集维护团队 巴塞罗那超级计算中心语言技术组(langtech@bsc.es)。 ### 资助信息 本工作由加泰罗尼亚政府通过[Aina项目](https://projecteaina.cat/)推动并资助。 本工作获西班牙数字化与公共职能部资助,由欧盟下一代欧盟(NextGenerationEU)资金支持,属于[ILENIA项目](https://proyectoilenia.es/)框架,项目编号:2022/TL22/00215337。 ### 致谢 我们衷心感谢以下个人与机构为本语料库提供的宝贵数据支持: - Jordi Suïls Subirà - 莱里达大学(UdL) - [阿利坎特大学语言与信息系统系 Transducens 研究组](https://github.com/transducens/PILAR) - [加泰罗尼亚视听传媒集团 - 3Cat](https://www.3cat.cat/corporatiu/ca/el-grup/) - [奥克语永久大会](https://locongres.org/) - [阿腊内塞语研究院-奥克语阿腊内塞语言学会(IEA-AALO)](http://www.institutestudisaranesi.cat) - [Linguatec-IA项目联合体](https://linguatec-ia.eu/consorcio/) ### 许可证信息 本作品采用[知识共享署名4.0国际许可协议](https://creativecommons.org/licenses/by/4.0/)发布。 ### 引用信息 [N/A] ### 贡献信息 [N/A]
提供机构:
BSC-LT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作