SirEthanK/en-ru-health-only-dataset
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/SirEthanK/en-ru-health-only-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- translation
language:
- en
- ru
tags:
- medical
- health
license: cc-by-nc-sa-4.0
size_categories:
- 1K<n<10K
---
# English–Russian Medical Parallel Dataset (Health-Only)
This dataset contains English–Russian sentence pairs drawn exclusively from
medical-domain corpora. It is intended for training, evaluating, and
fine-tuning neural machine translation (NMT) models on health-related
terminology and sentences.
The data includes cleaned and aligned English–Russian parallel sentences sourced
from WikiHealth and TICO-19, both widely used in medical-domain machine
translation research.
---
## Dataset Structure
The repository includes three CSV files:
- `train.csv`
- `val.csv`
- `test.csv`
## Split Sizes
| Split | Rows |
|-------|------|
| Train | 6,244 |
| Validation | 400 |
| Test | 400 |
## Data Sources
This health-only dataset is composed of samples from:
### **1. WikiHealth (Health Instruction Dataset)**
A multilingual biomedical dataset consisting of health-related instructions and
explanations aligned across languages.
> Huang, J., et al. **“WikiHowToImprove: Instructional Text for Healthcare Applications.”**
> *NeurIPS 2021 Datasets and Benchmarks Track.*
### **2. TICO-19 (Translation Initiative for Covid-19)**
A multilingual parallel corpus developed during the COVID-19 pandemic to support
machine translation for public health and crisis communication.
> Anastasopoulos, A., et al. **“TICO-19: The Translation Initiative for Covid-19.”**
> *arXiv preprint arXiv:2007.01788, 2020.*
Both corpora were cleaned, stripped of markup, and normalized prior to
construction of the final train–val–test splits.
## Loading the Dataset
You can load the dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
ds = load_dataset("SirEthanK/en-ru-health-only-dataset")
train_ds = ds["train"]
val_ds = ds["validation"]
test_ds = ds["test"]
```
## Preprocessing
The data preprocessing procedure consisted of:
- dropping NaNs, stripping whitespace, and collapsing internal spaces
- normalizing Unicode data (NFC format)
- converting text to lowercase
- standardizing punctuation
- filtering by script
- de-duplication
- filtering by length and length ratio
- stripping HTML and dropping code-like lines
- some heuristic tuning was applied to remove garbage or corrupted rows
- some manual fixing of translations was done to fix almost correct translations
## License
This dataset is released under the **CC BY-NC-SA 4.0** license.
It contains processed and cleaned subsets of:
- **WikiHealth**, derived from WikiHow data licensed under CC BY-NC-SA 3.0
- **TICO-19**, licensed under CC BY 4.0
Because the dataset incorporates WikiHow-derived material, the resulting dataset
must be distributed under a compatible non-commercial, share-alike license.
提供机构:
SirEthanK



