nhuvo/MedEV

Name: nhuvo/MedEV
Creator: nhuvo
Published: 2024-03-29 04:56:53
License: 暂无描述

Hugging Face2024-03-29 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/nhuvo/MedEV

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - translation language: - en - vi --- # MedEV Dataset ## Introduction The MedEV dataset marks a notable advancement in machine translation, focusing on the Vietnamese-English pair in the medical field. Its purpose is to address the lack of high-quality Vietnamese-English parallel data by offering around 360K sentence pairs. This dataset is designed to support the advancement of machine translation in medical domain, serving as a valuable tool to improve the precision and trustworthiness of medical translations between Vietnamese and English. ## Dataset Overview - **Domain:** Medical - **Language Pair:** Vietnamese-English - **Size:** ~360,000 sentence pairs - **Objective:** To support the development of machine translation models specifically tuned for the medical domain. - **Traning set:** 340,897 sentence pairs - **Valiation set:** 8,982 sentence pairs - **Test set:** 9,006 sentence pairs ## Accessing the Dataset The MedEV dataset is available for research purposes. [Please refer to our paper](https://arxiv.org/abs/2403.19161) for detailed information on the dataset construction, experimental setup, and analysis of results. ## Ethical Statement Data are collected from publicly available websites, such as journals and universities, but also from www.msd.com. The content extracted from these sources cannot be used for public or commercial purposes. Therefore, the content also contains no private data about the patients. ## Citing MedEV If you find the MedEV dataset useful in your research, please consider citing our paper: ``` @inproceedings{medev, title = {{Improving Vietnamese-English Medical Machine Translation}}, author = {Nhu Vo and Dat Quoc Nguyen and Dung D. Le and Massimo Piccardi and Wray Buntine}, booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, year = {2024} } ```

提供机构：

nhuvo

原始信息汇总

MedEV Dataset 概述

数据集基本信息

任务类别： 翻译
语言： 英语（en）、越南语（vi）

数据集详细信息

领域： 医疗
语言对： 越南语-英语
大小： 约360,000句对
目标： 支持医疗领域机器翻译模型的发展
训练集： 340,897句对
验证集： 8,982句对
测试集： 9,006句对

数据集用途

该数据集旨在通过提供高质量的越南语-英语平行数据，支持医疗领域机器翻译的精确性和可信度提升。

5,000+

优质数据集

54 个

任务类型

进入经典数据集