SirEthanK/en-ru-health-only-dataset

Name: SirEthanK/en-ru-health-only-dataset
Creator: SirEthanK
Published: 2025-12-10 03:01:47
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/SirEthanK/en-ru-health-only-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - translation language: - en - ru tags: - medical - health license: cc-by-nc-sa-4.0 size_categories: - 1K<n<10K --- # English–Russian Medical Parallel Dataset (Health-Only) This dataset contains English–Russian sentence pairs drawn exclusively from medical-domain corpora. It is intended for training, evaluating, and fine-tuning neural machine translation (NMT) models on health-related terminology and sentences. The data includes cleaned and aligned English–Russian parallel sentences sourced from WikiHealth and TICO-19, both widely used in medical-domain machine translation research. --- ## Dataset Structure The repository includes three CSV files: - `train.csv` - `val.csv` - `test.csv` ## Split Sizes | Split | Rows | |-------|------| | Train | 6,244 | | Validation | 400 | | Test | 400 | ## Data Sources This health-only dataset is composed of samples from: ### **1. WikiHealth (Health Instruction Dataset)** A multilingual biomedical dataset consisting of health-related instructions and explanations aligned across languages. > Huang, J., et al. **“WikiHowToImprove: Instructional Text for Healthcare Applications.”** > *NeurIPS 2021 Datasets and Benchmarks Track.* ### **2. TICO-19 (Translation Initiative for Covid-19)** A multilingual parallel corpus developed during the COVID-19 pandemic to support machine translation for public health and crisis communication. > Anastasopoulos, A., et al. **“TICO-19: The Translation Initiative for Covid-19.”** > *arXiv preprint arXiv:2007.01788, 2020.* Both corpora were cleaned, stripped of markup, and normalized prior to construction of the final train–val–test splits. ## Loading the Dataset You can load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset ds = load_dataset("SirEthanK/en-ru-health-only-dataset") train_ds = ds["train"] val_ds = ds["validation"] test_ds = ds["test"] ``` ## Preprocessing The data preprocessing procedure consisted of: - dropping NaNs, stripping whitespace, and collapsing internal spaces - normalizing Unicode data (NFC format) - converting text to lowercase - standardizing punctuation - filtering by script - de-duplication - filtering by length and length ratio - stripping HTML and dropping code-like lines - some heuristic tuning was applied to remove garbage or corrupted rows - some manual fixing of translations was done to fix almost correct translations ## License This dataset is released under the **CC BY-NC-SA 4.0** license. It contains processed and cleaned subsets of: - **WikiHealth**, derived from WikiHow data licensed under CC BY-NC-SA 3.0 - **TICO-19**, licensed under CC BY 4.0 Because the dataset incorporates WikiHow-derived material, the resulting dataset must be distributed under a compatible non-commercial, share-alike license.

提供机构：

SirEthanK

5,000+

优质数据集

54 个

任务类型

进入经典数据集