five

Senju2/context-aware-arabic-to-english-model-with-register

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Senju2/context-aware-arabic-to-english-model-with-register
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - ar - en tags: - translation - arabic-dialects - nlp pretty_name: Context-Aware Arabic Dialect Translation Dataset --- # Context-Aware Arabic Dialect Translation Dataset This repository contains the dataset and code for the paper **"Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection"** (Anonymous Submission). ## Contents - **`context_aware_en_ar_v2.ipynb`**: The main Google Colab notebook used for training and evaluation. - **`balanced_dataset_ready.csv`**: The full augmented dataset (57,600 sentence pairs) produced by our RBDA pipeline. - **`train_dataset.csv`**: The strict training split (95%). - **`test_dataset.csv`**: The unseen test split (5%) used for the results reported in the paper. - **`code/`**: Directory containing the training scripts and augmentation logic used to reproduce our results. - `train_model_optimized.py`: The main training loop for fine-tuning mT5. - `build_dataset.py`: The RBDA pipeline code. - `requirements.txt`: Python dependencies. ## Dataset Structure The columns in the CSV files are: - `input`: The source English text with control tags (e.g., `[Egyptian] [Medical] I have a headache`). - `target`: The target Arabic translation in the specific dialect. - `region`: The dialect label (Egyptian, Levantine, Gulf, etc.). - `context`: The social context (Medical, Travel, etc.). - `style`: The register (Formal/Informal).
提供机构:
Senju2
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作