English-Thai Code-switched Medical Translation Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/preceptorai-org/NLLB_CS_EM_NLP2024
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了64,000组英泰混合语码的医疗文本翻译对,其中还有1,100组作为测试保留。该数据集专注于保留医学专业术语,并采用了一种遮罩方法,利用GPT-4识别医学关键词,同时运用数据增强和过滤技术以提高数据质量。具体规模为:训练用翻译对64,000组,测试用1,100组,任务旨在机器翻译。
This dataset contains 64,000 English-Thai code-mixed medical text translation pairs, with 1,100 pairs reserved for testing. It focuses on preserving medical professional terminology, and adopts a masking method that uses GPT-4 to identify medical keywords, while applying data augmentation and filtering techniques to improve data quality. The specific scale is as follows: 64,000 translation pairs for training, 1,100 pairs for testing, and the task targets machine translation.
提供机构:
In-house LLM-based application



