mteb/NTREX

Name: mteb/NTREX
Creator: mteb
Published: 2025-05-04 16:08:57
License: 暂无描述

Hugging Face2025-05-04 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/mteb/NTREX

下载链接

链接失效反馈

官方服务：

资源简介：

NTREXBitextMining是一个新闻测试参考数据集，用于机器翻译评估，包含从英语翻译成128种语言的翻译。

NTREXBitextMining is a News Test References dataset for Machine Translation Evaluation, covering translation from English into 128 languages.

提供机构：

mteb

原始信息汇总

数据集概述

名称: NTREX -- News Test References for MT Evaluation

语言: 支持128种语言，包括但不限于Afrikaans, Amharic, Arabic, Azerbaijani, Bashkir, Belarusian, Bengali, Tibetan, Bosnian, Bulgarian, Catalan, Czech, Sorani Kurdish, Welsh, Danish, German, Dhivehi, Dzongkha, Greek, English, Estonian, Basque, Ewe, Faroese, Persian, Fijian, Filipino, Finnish, French, Irish, Galician, Gujarati, Hausa, Hebrew, Hindi, Hmong, Croatian, Hungarian, Armenian, Igbo, Indonesian, Icelandic, Italian, Japanese, Kannada, Georgian, Kazakh, Khmer, Kinyarwanda, Kyrgyz, Northern Kurdish, Korean, Lao, Latvian, Lithuanian, Luxembourgish, Malayalam, Marathi, Hassaniya Arabic, Macedonian, Malagasy, Maltese, Mongolian, Maori, Malay, Burmese, Ndebele, Nepali, Dutch, Norwegian Nynorsk, Norwegian Bokmål, Northern Sotho, Chichewa, Oromo, Punjabi (Gurmukhi), Polish, Portuguese, Dari, Pashto, Romanian, Russian, Tachelhit, Sinhala, Slovak, Slovenian, Samoan, Shona, Sindhi, Somali, Spanish, Albanian, Serbian, Swati, Swahili, Swedish, Tahitian, Tamil, Tatar, Telugu, Tajik, Thai, Tigrinya, Tongan, Tswana, Turkmen, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Venda, Vietnamese, Wolof, Xhosa, Yoruba, Cantonese, Chinese (Simplified), Chinese (Traditional), Zulu.

许可证: CC-BY-SA-4.0

多语言性: 支持翻译任务

任务类别: 翻译

大小: 1997

配置:

默认配置:
- 数据文件:
  - 测试集: test.parquet

引用信息

若引用此数据集，请使用以下引用信息：

@inproceedings{federmann-etal-2022-ntrex, title = "{NTREX}-128 {--} News Test References for {MT} Evaluation of 128 Languages", author = "Federmann, Christian and Kocmi, Tom and Xin, Ying", booktitle = "Proceedings of the First Workshop on Scaling Up Multilingual Evaluation", month = "nov", year = "2022", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.sumeval-1.4", pages = "21--24", }

同时，也请引用提供英语源数据的WMT 2019论文：

@inproceedings{barrault-etal-2019-findings, title = "Findings of the 2019 Conference on Machine Translation ({WMT}19)", author = {Barrault, Lo{"i}c and Bojar, Ond{v{r}}ej and Costa-juss{`a}, Marta R. and Federmann, Christian and Fishel, Mark and Graham, Yvette and Haddow, Barry and Huck, Matthias and Koehn, Philipp and Malmasi, Shervin and Monz, Christof and M{"u}ller, Mathias and Pal, Santanu and Post, Matt and Zampieri, Marcos}, editor = "Bojar, Ond{v{r}}ej and Chatterjee, Rajen and Federmann, Christian and Fishel, Mark and Graham, Yvette and Haddow, Barry and Huck, Matthias and Yepes, Antonio Jimeno and Koehn, Philipp and Martins, Andr{e} and Monz, Christof and Negri, Matteo and N{e}v{e}ol, Aur{e}lie and Neves, Mariana and Post, Matt and Turchi, Marco and Verspoor, Karin", booktitle = "Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W19-5301", doi = "10.18653/v1/W19-5301", pages = "1--61", }

搜集汇总

数据集介绍

背景与挑战

背景概述

The NTREX dataset is a multilingual benchmark for machine translation evaluation, covering 128 languages with a focus on news domain texts. It features 1,997 rows of data, totaling 16.1 MB, and is structured for text embedding and translation tasks, supported by the MTEB framework.

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集