SINAI/ALIA-parallel-translation

Name: SINAI/ALIA-parallel-translation
Creator: SINAI
Published: 2026-04-27 07:04:19
License: 暂无描述

Hugging Face2026-04-27 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/SINAI/ALIA-parallel-translation

下载链接

链接失效反馈

官方服务：

资源简介：

ALIA平行翻译语料库是一个广泛的西班牙语-英语平行文本集合，涵盖三个专业领域：法律-行政、生物医学和文化遗产。该语料库包含35,753,765个平行段落，总计约69.7 GB，旨在通过持续预训练和领域适应的语言模型来提高西班牙语-英语翻译质量。语料库优先考虑文档级和多段落翻译上下文，超越传统的句子级方法。每个段落通过系统的ID前缀系统标识领域：生物医学领域（00-XX-XXXXXX）、文化遗产领域（01-XX-XXXXXX）和法律-行政领域（02-XX-XXXXXX）。该语料库由SINAI研究小组（智能信息访问系统）通过高级信息与通信技术研究中心（CEATIC）精心策划，并由欧盟NextGenerationEU框架下的西班牙数字转型与公共职能部资助。

The ALIA Parallel Translation Corpus is an extensive collection of Spanish-English parallel texts spanning three specialized domains: Legal-Administrative, Biomedical, and Heritage. With 35,753,765 parallel segments totaling approximately 69.7 GB, this corpus was developed as part of the ALIA projects machine translation activity to improve Spanish-English translation quality through continual pre-training and domain adaptation of language models. The corpus prioritizes document-level and multi-paragraph translation contexts, moving beyond traditional sentence-level approaches. Each segment is identified by domain through a systematic ID prefix system: Biomedical domain (00-XX-XXXXXX), Heritage domain (01-XX-XXXXXX), and Legal-Administrative domain (02-XX-XXXXXX). Curated by the SINAI Research Group (Intelligent Systems for Information Access) at the Universidad de Jaén through the Center for Advanced Studies in Information and Communication Technologies (CEATIC), and funded by the Ministerio para la Transformación Digital y de la Función Pública under the EU NextGenerationEU framework.

提供机构：

SINAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集