pythainlp/scb_mt_enth_2020

Name: pythainlp/scb_mt_enth_2020
Creator: pythainlp
Published: 2025-08-14 16:20:40
License: 暂无描述

Hugging Face2025-08-14 更新2025-09-13 收录

下载链接：

https://hf-mirror.com/datasets/pythainlp/scb_mt_enth_2020

下载链接

链接失效反馈

官方服务：

资源简介：

scb-mt-en-th-2020：一个大规模的英泰平行语料库。我们的主要目标是建立一个大规模的英泰数据集，用于机器翻译。我们从各种来源构建了一个包含超过100万个段对英泰机器翻译数据集，包括新闻、维基百科文章、短信、基于任务的对话、网络爬取数据和政府文件。我们以可重复的方式介绍了收集数据、构建平行文本和删除噪声句子对的方法。我们基于此数据集训练了机器翻译模型。我们的模型性能与Google翻译API（截至2020年5月）相当，当Open Parallel Corpus（OPUS）被包含在训练数据中时，泰英和英泰翻译都优于Google。数据集、预训练模型和用于复制我们工作的源代码可供公众使用。

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus. The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.

提供机构：

pythainlp

5,000+

优质数据集

54 个

任务类型

进入经典数据集