nhuvo/MedEV
收藏Hugging Face2024-03-29 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/nhuvo/MedEV
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- translation
language:
- en
- vi
---
# MedEV Dataset
## Introduction
The MedEV dataset marks a notable advancement in machine translation, focusing on the Vietnamese-English pair in the medical field. Its purpose is to address the lack of high-quality Vietnamese-English parallel data by offering around 360K sentence pairs. This dataset is designed to support the advancement of machine translation in medical domain, serving as a valuable tool to improve the precision and trustworthiness of medical translations between Vietnamese and English.
## Dataset Overview
- **Domain:** Medical
- **Language Pair:** Vietnamese-English
- **Size:** ~360,000 sentence pairs
- **Objective:** To support the development of machine translation models specifically tuned for the medical domain.
- **Traning set:** 340,897 sentence pairs
- **Valiation set:** 8,982 sentence pairs
- **Test set:** 9,006 sentence pairs
## Accessing the Dataset
The MedEV dataset is available for research purposes. [Please refer to our paper](https://arxiv.org/abs/2403.19161) for detailed information on the dataset construction, experimental setup, and analysis of results.
## Ethical Statement
Data are collected from publicly available websites, such as journals and universities, but also from www.msd.com. The content extracted from these sources cannot be used for public or commercial purposes. Therefore, the content also contains no private data about the patients.
## Citing MedEV
If you find the MedEV dataset useful in your research, please consider citing our paper:
```
@inproceedings{medev,
title = {{Improving Vietnamese-English Medical Machine Translation}},
author = {Nhu Vo and Dat Quoc Nguyen and Dung D. Le and Massimo Piccardi and Wray Buntine},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)},
year = {2024}
}
```
提供机构:
nhuvo
原始信息汇总
MedEV Dataset 概述
数据集基本信息
- 任务类别: 翻译
- 语言: 英语(en)、越南语(vi)
数据集详细信息
- 领域: 医疗
- 语言对: 越南语-英语
- 大小: 约360,000句对
- 目标: 支持医疗领域机器翻译模型的发展
- 训练集: 340,897句对
- 验证集: 8,982句对
- 测试集: 9,006句对
数据集用途
该数据集旨在通过提供高质量的越南语-英语平行数据,支持医疗领域机器翻译的精确性和可信度提升。



