Senju2/context-aware-arabic-to-english-model-with-register

Name: Senju2/context-aware-arabic-to-english-model-with-register
Creator: Senju2
Published: 2026-04-07 15:37:40
License: 暂无描述

Hugging Face2026-04-07 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Senju2/context-aware-arabic-to-english-model-with-register

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - ar - en tags: - translation - arabic-dialects - nlp pretty_name: Context-Aware Arabic Dialect Translation Dataset --- # Context-Aware Arabic Dialect Translation Dataset This repository contains the dataset and code for the paper **"Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection"** (Anonymous Submission). ## Contents - **`context_aware_en_ar_v2.ipynb`**: The main Google Colab notebook used for training and evaluation. - **`balanced_dataset_ready.csv`**: The full augmented dataset (57,600 sentence pairs) produced by our RBDA pipeline. - **`train_dataset.csv`**: The strict training split (95%). - **`test_dataset.csv`**: The unseen test split (5%) used for the results reported in the paper. - **`code/`**: Directory containing the training scripts and augmentation logic used to reproduce our results. - `train_model_optimized.py`: The main training loop for fine-tuning mT5. - `build_dataset.py`: The RBDA pipeline code. - `requirements.txt`: Python dependencies. ## Dataset Structure The columns in the CSV files are: - `input`: The source English text with control tags (e.g., `[Egyptian] [Medical] I have a headache`). - `target`: The target Arabic translation in the specific dialect. - `region`: The dialect label (Egyptian, Levantine, Gulf, etc.). - `context`: The social context (Medical, Travel, etc.). - `style`: The register (Formal/Informal).

提供机构：

Senju2

5,000+

优质数据集

54 个

任务类型

进入经典数据集