Senju2/context-aware-arabic-to-english-model-with-register
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Senju2/context-aware-arabic-to-english-model-with-register
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- ar
- en
tags:
- translation
- arabic-dialects
- nlp
pretty_name: Context-Aware Arabic Dialect Translation Dataset
---
# Context-Aware Arabic Dialect Translation Dataset
This repository contains the dataset and code for the paper **"Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection"** (Anonymous Submission).
## Contents
- **`context_aware_en_ar_v2.ipynb`**: The main Google Colab notebook used for training and evaluation.
- **`balanced_dataset_ready.csv`**: The full augmented dataset (57,600 sentence pairs) produced by our RBDA pipeline.
- **`train_dataset.csv`**: The strict training split (95%).
- **`test_dataset.csv`**: The unseen test split (5%) used for the results reported in the paper.
- **`code/`**: Directory containing the training scripts and augmentation logic used to reproduce our results.
- `train_model_optimized.py`: The main training loop for fine-tuning mT5.
- `build_dataset.py`: The RBDA pipeline code.
- `requirements.txt`: Python dependencies.
## Dataset Structure
The columns in the CSV files are:
- `input`: The source English text with control tags (e.g., `[Egyptian] [Medical] I have a headache`).
- `target`: The target Arabic translation in the specific dialect.
- `region`: The dialect label (Egyptian, Levantine, Gulf, etc.).
- `context`: The social context (Medical, Travel, etc.).
- `style`: The register (Formal/Informal).
提供机构:
Senju2



