arz-en-parallel-corpus
收藏埃及阿拉伯语-英语平行语料库 🇪🇬✨🇬🇧
数据集描述
- 名称: Egyptian Arabic-English Parallel Corpus
- 语言: 埃及阿拉伯语 (
arz), 英语 (en) - 用途: 机器翻译、语音识别等NLP任务
- 数据量: ~27,000条平行句对
- 许可证: MIT License
数据集结构
- 特征:
arz: 埃及阿拉伯语句子en: 英语句子
- 分割:
分割 样本数 字节数 训练集 25,000 3,686,265 测试集 1,851 275,240 总计 26,851 3,961,505
数据来源
- ArzEn-MultiGenre Dataset by Hesham Haroon
- Egyptian_English_parallel Dataset by Hesham Haroon
- ArzEn_MultiGenre_subtitles Dataset by arbml
- ArzEn-ST Corpus from ArzEn Corpus Resources
预处理步骤
- 移除仅含数字或无意义标记的行
- 移除特殊标签(如[HES], [LAUGHTER])和括号内容
- 过滤缺失或空翻译
- 消除重复样本
- 随机打乱数据集顺序
使用示例
python from datasets import load_dataset dataset = load_dataset("IbrahimAmin/arz-en-parallel-corpus") print(dataset[train][0])
引用
bibtex @inproceedings{hamed-etal-2022-arzen, title = {ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English}, author = {Hamed, Injy and Habash, Nizar and Abdennadher, Slim and Vu, Ngoc Thang}, booktitle = {Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)}, pages = {119--130}, year = {2022}, publisher = {Association for Computational Linguistics} }
bibtex @article{al-sabbagh-2024-arzen-multigenre, title = {ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations}, author = {Al-Sabbagh, Rania}, journal = {Data in Brief}, volume = {54}, pages = {110271}, year = {2024}, publisher = {Elsevier} }
bibtex @misc{amin2025arzenparallel, author = {Amin, Ibrahim}, title = {Egyptian Arabic - English Parallel Corpus}, year = {2025}, url = {https://huggingface.co/datasets/IbrahimAmin/arz-en-parallel-corpus}, note = {MIT License. Curated and cleaned from multiple public datasets.} }




