HiTZ/ALIA_syntethic_MT
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/HiTZ/ALIA_syntethic_MT
下载链接
链接失效反馈官方服务:
资源简介:
ALIA Synthetic MT是一个并行语料库,源自Berria新闻文章,包含2025年发布的内容和2023年的存档材料。该数据集提供了由两种不同的大型语言模型(Qwen3-32B和LatxaQ)生成的英语和西班牙语合成翻译。数据集结构为JSONL格式,每个条目包含id、eu、en、es四个字段,分别表示唯一标识符、巴斯克语原文、英语合成翻译和西班牙语合成翻译。数据集总文档数为68863,平均每文档段落数在Qwen3-32B和LatxaQ模型中分别为9.50和7.79。数据集适用于机器翻译模型的训练和评估、合成数据实验以及多语言NLP研究。数据集遵循Creative Commons CC-BY-SA许可。
ALIA Synthetic MT is a parallel corpus derived from Berria news articles, comprising content published in 2025 as well as archived material from 2023. The dataset provides synthetic translations into English and Spanish, generated using two distinct Large Language Models: Qwen3-32B and LatxaQ. The dataset is formatted as a JSONL file, with each entry containing the following fields: id (unique identifier), eu (original source text in Basque), en (synthetic English translation), and es (synthetic Spanish translation). The total number of documents is 68863, with an average of 9.50 paragraphs per document for Qwen3-32B and 7.79 for LatxaQ. The dataset is intended for training and evaluation of machine translation models, synthetic data experiments, and multilingual NLP research. It is released under the Creative Commons CC-BY-SA license.
提供机构:
HiTZ



