BSC-LT/ALIA_mixed_authentic_synthetic_MT

Name: BSC-LT/ALIA_mixed_authentic_synthetic_MT
Creator: BSC-LT
Published: 2025-12-17 12:36:55
License: 暂无描述

Hugging Face2025-12-17 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/BSC-LT/ALIA_mixed_authentic_synthetic_MT

下载链接

链接失效反馈

官方服务：

资源简介：

ALIA混合真实与合成多语言平行数据集是一个大规模的多语言平行语料库，涵盖英语和西班牙语与阿拉伯语、印地语、中文、日语和韩语的配对。该数据集通过聚合和精心筛选多个公共资源构建，为机器翻译系统的训练提供句子级别的对齐。西班牙语-印地语和西班牙语-中文部分的数据集包括使用SalamandraTA 7B Instruct从英语生成的合成西班牙语翻译。数据集包含超过4.53亿个句子，经过严格的过滤和规范化处理，包括对齐过滤、语言识别、文本规范化和去重。数据集旨在促进英语/西班牙语与多种目标语言之间的机器翻译发展，支持多语言NLP研究，并促进多样语言对翻译系统的开发。

The ALIA Mixed Authentic and Synthetic Multilingual Parallel Dataset is a large-scale multilingual parallel corpus covering English and Spanish paired with Arabic, Hindi, Chinese, Japanese, and Korean. Built by aggregating and carefully filtering multiple public sources, it provides sentence-level alignments for training Machine Translation systems. The Spanish–Hindi and Spanish–Chinese portions of the dataset include synthetic Spanish translations generated from English using SalamandraTA 7B Instruct. The dataset contains over 453 million sentences and has undergone rigorous filtering and normalization, including alignment filtering, language identification, text normalization, and deduplication. The dataset is aimed at promoting the development of Machine Translation between English/Spanish and multiple target languages, supporting research in multilingual NLP, and facilitating the development of translation systems for diverse language pairs.

提供机构：

BSC-LT

5,000+

优质数据集

54 个

任务类型

进入经典数据集