five

Tarjama-25

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/Tarjama-25
下载链接
链接失效反馈
官方服务:
资源简介:
# Tarjama-25 **Tarjama-25** is a high-quality, *bidirectional* Arabic ↔ English machine–translation benchmark built to stress-test modern MT models. Unlike most English-centric evaluation sets, Tarjama-25 contains long, domain-balanced sentences originally written **half in Arabic and half in English**, then professionally translated and verified in both directions. | # sentence pairs | Avg. tokens / sentence | Domains | |------------------|------------------------|---------| | **5000** | 50 – 100 (≈ 75) | Scientific · Technical · Healthcare · Cultural · General | > The full pipeline—collection → MT pre-translation → human correction → expert validation—ensures that every example is **clean, contextually correct, and free from web-scale pre-training contamination**. ## Why another benchmark? Modern LLM-based MT systems can handle 4K token contexts, yet most public test sets still top out at a few dozen words and rarely flip the language direction. Tarjama-25 fills these gaps: * **Bidirectional**: equal coverage of Arabic→English *and* English→Arabic. * **Long contexts**: 50–100-word sentences push models beyond “tweet-length” translation. * **Domain diversity**: covers multiple domains from general, news, Islamic, medical, and many more. * **Human-validated**: Professional translators and subject-matter experts reviewed every test sentence twice. ## Dataset structure | Field | Type | Description | |------------------|--------|------------------------------------------| | `Arabic` | string | Arabic sentence | | `English` | string | English sentence | | `category` | string | General, News, Islamic, Medical, Chemistry, and Physics | | `source` | string | `"en-to-ar"` or `"ar-to-en"` | ## Usage ```python from datasets import load_dataset ds = load_dataset("Misraj/Tarjama-25") ``` ## Evaluation ### Benchmark Results on Tarjama-25 : | | | **Arabic → English** | | | **English → Arabic**| | | |-------|------|---------------------------|--------|------|----------------------------|--------|------| | Model | Size | COMET | ChrF++ | BLEU | COMET | ChrF++ | BLEU | | **Mutarjim** | 1.5 B | 82.63 | 74.66 | **55.28** | **83.41** | **68.67** | **43.71** | | NLLB | 3.3 B | 67.06 | 40.50 | 24.38 | 81.27 | 59.69 | 30.32 | | c4ai | 7 B | 80.93 | 67.24 | 43.34 | 79.10 | 55.96 | 25.18 | | Yehia | 7 B | 73.31 | 56.77 | 32.14 | 74.97 | 50.32 | 20.67 | | ALLam | 7 B | 72.90 | 56.88 | 31.01 | 75.41 | 51.24 | 20.54 | | Cohere | 8 B | 81.20 | 67.16 | 42.72 | 82.50 | 58.46 | 26.26 | | AceGPT | 8 B | 80.71 | 65.63 | 38.67 | 78.39 | 50.67 | 20.02 | | LLaMAX3 | 8 B | 77.72 | 54.95 | 27.86 | 56.76 | 33.25 | 7.63 | | SILMA | 9 B | 64.36 | 37.84 | 15.67 | 58.01 | 27.71 | 5.62 | | GemmaX | 9 B | 69.63 | 43.42 | 19.96 | 66.94 | 37.66 | 9.98 | | XALMA | 13 B | 73.37 | 46.96 | 21.57 | 66.36 | 29.88 | 6.64 | | Gemma-2 | 27 B | 80.81 | 70.42 | 42.78 | 42.20 | 3.52 | 3.08 | | Cohere | 32 B | 82.44 | 73.10 | 51.16 | 82.09 | 63.29 | 32.25 | | GPT-4o mini | – | **83.67** | **76.08** | 54.24 | 83.36 | 66.36 | 38.52 | **Key takeaways** **Mutarjim** outperforms all other models on every metric for **English → Arabic**, and secures the top **BLEU** score for **Arabic → English** despite being vastly smaller (1.5 B vs. 7 B–32 B parameters). GPT-4o mini edges out Mutarjim on COMET and ChrF++ for Arabic → English, illustrating how well-balanced Tarjama-25 is across directions and metrics. We recommend using the open-source Mutarjim-evaluation toolkit, which already supports Tarjama-25: ```cm git clone https://github.com/misraj-ai/Mutarjim-evaluation ``` ## Citation If you use Tarjama-25 in your research, please cite: ```latex @misc{hennara2025mutarjimadvancingbidirectionalarabicenglish, title={Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model}, author={Khalil Hennara and Muhammad Hreden and Mohamed Motaism Hamed and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan}, year={2025}, eprint={2505.17894}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.17894}, } ```

# Tarjama-25 **Tarjama-25** 是一款高质量、双向(bidirectional)阿拉伯语↔英语机器翻译(MT, Machine Translation)评测基准集,旨在对现代机器翻译模型进行严苛性能测试。 与大多数以英语为中心的评测集不同,Tarjama-25包含长文本、领域均衡的句子——这些句子最初由阿拉伯语和英语各半撰写,随后经过专业双向翻译与验证。 | # 句子对数量 | 单句平均Token数 | 覆盖领域 | |--------------|----------------|----------| | **5000** | 50–100(约75) | 科学 · 技术 · 医疗 · 文化 · 通用 | > 完整流程——数据采集 → 机器翻译预翻译 → 人工校对 → 专家验证——确保每一条样本都**干净合规、上下文准确,且未受到网页级预训练数据的污染**。 ## 为何推出这款评测基准? 当前基于大语言模型(LLM/Large Language Model)的机器翻译系统已可处理4K Token的上下文,但多数公开测试集的最大长度仍仅为数十字,且极少支持双向语言转换。Tarjama-25填补了这些空白: * **双向支持**:同时覆盖阿拉伯语→英语与英语→阿拉伯语,二者占比均衡。 * **长上下文**:50–100词的句子长度,将模型的翻译能力推离“推文长度”的局限。 * **领域多样性**:涵盖通用、新闻、伊斯兰、医疗、化学及物理等多个领域。 * **人工验证**:专业译员与领域专家对每条测试句进行了两轮审核。 ## 数据集结构 | 字段名 | 数据类型 | 描述 | |----------------|----------|--------------------------| | `Arabic` | 字符串 | 阿拉伯语句子 | | `English` | 字符串 | 英语句子 | | `category` | 字符串 | 通用、新闻、伊斯兰、医疗、化学及物理 | | `source` | 字符串 | `"en-to-ar"` 或 `"ar-to-en"` | ## 使用方法 python from datasets import load_dataset ds = load_dataset("Misraj/Tarjama-25") ## 评测结果 ### Tarjama-25 基准评测结果: | | | **阿拉伯语→英语** | | | **英语→阿拉伯语** | | | |----------|--------|-------------------|-------|-------|-------------------|-------|-------| | 模型名称 | 参数规模 | COMET | ChrF++ | BLEU | COMET | ChrF++ | BLEU | | **Mutarjim** | 1.5B | 82.63 | 74.66 | **55.28** | **83.41** | **68.67** | **43.71** | | NLLB | 3.3B | 67.06 | 40.50 | 24.38 | 81.27 | 59.69 | 30.32 | | c4ai | 7B | 80.93 | 67.24 | 43.34 | 79.10 | 55.96 | 25.18 | | Yehia | 7B | 73.31 | 56.77 | 32.14 | 74.97 | 50.32 | 20.67 | | ALLam | 7B | 72.90 | 56.88 | 31.01 | 75.41 | 51.24 | 20.54 | | Cohere | 8B | 81.20 | 67.16 | 42.72 | 82.50 | 58.46 | 26.26 | | AceGPT | 8B | 80.71 | 65.63 | 38.67 | 78.39 | 50.67 | 20.02 | | LLaMAX3 | 8B | 77.72 | 54.95 | 27.86 | 56.76 | 33.25 | 7.63 | | SILMA | 9B | 64.36 | 37.84 | 15.67 | 58.01 | 27.71 | 5.62 | | GemmaX | 9B | 69.63 | 43.42 | 19.96 | 66.94 | 37.66 | 9.98 | | XALMA | 13B | 73.37 | 46.96 | 21.57 | 66.36 | 29.88 | 6.64 | | Gemma-2 | 27B | 80.81 | 70.42 | 42.78 | 42.20 | 3.52 | 3.08 | | Cohere | 32B | 82.44 | 73.10 | 51.16 | 82.09 | 63.29 | 32.25 | | GPT-4o mini | – | **83.67** | **76.08** | 54.24 | 83.36 | 66.36 | 38.52 | **核心结论** **Mutarjim** 在英语→阿拉伯语方向的所有评测指标上均优于其他模型;尽管其参数规模仅为1.5B,远小于7B至32B参数的其他模型,但仍拿下了阿拉伯语→英语方向的最高BLEU分数。 GPT-4o mini在阿拉伯语→英语方向的COMET与ChrF++指标上小幅领先Mutarjim,这也体现了Tarjama-25在双向评测与指标设置上的均衡性。 我们推荐使用开源的Mutarjim评测工具包,该工具包已原生支持Tarjama-25: cm git clone https://github.com/misraj-ai/Mutarjim-evaluation ## 引用说明 如果您在研究中使用Tarjama-25,请引用以下文献: latex @misc{hennara2025mutarjimadvancingbidirectionalarabicenglish, title={Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model}, author={Khalil Hennara and Muhammad Hreden and Mohamed Motaism Hamed and Zeina Aldallal and Sara Chrouf and Safwan AlModhayan}, year={2025}, eprint={2505.17894}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.17894}, }
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作