five

wmt24pp

收藏
魔搭社区2026-05-16 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/wmt24pp
下载链接
链接失效反馈
官方服务:
资源简介:
# WMT24++ This repository contains the human translation and post-edit data for the 55 en->xx language pairs released in the publication [WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects](https://arxiv.org/abs/2502.12404). If you are interested in the MT/LLM system outputs and automatic metric scores, please see [MTME](https://github.com/google-research/mt-metrics-eval/tree/main?tab=readme-ov-file#wmt24-data). If you are interested in the images of the source URLs for each document, please see [here](https://huggingface.co/datasets/google/wmt24pp-images). ## Schema Each language pair is stored in its own jsonl file. Each row is a serialized JSON object with the following fields: - `lp`: The language pair (e.g., `"en-de_DE"`). - `domain`: The domain of the source, either `"canary"`, `"news"`, `"social"`, `"speech"`, or `"literary"`. - `document_id`: The unique ID that identifies the document the source came from. - `segment_id`: The globally unique ID that identifies the segment. - `is_bad_source`: A Boolean that indicates whether this source is low quality (e.g., HTML, URLs, emoijs). In the paper, the segments marked as true were removed from the evaluation, and we recommend doing the same. - `source`: The English source text. - `target`: The post-edit of `original_target`. We recommend using the post-edit as the default reference. - `original_target`: The original reference translation. ## Citation If you use any of the data released in our work, please cite the following paper: ``` @misc{deutsch2025wmt24expandinglanguagecoverage, title={{WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects}}, author={Daniel Deutsch and Eleftheria Briakou and Isaac Caswell and Mara Finkelstein and Rebecca Galor and Juraj Juraska and Geza Kovacs and Alison Lui and Ricardo Rei and Jason Riesa and Shruti Rijhwani and Parker Riley and Elizabeth Salesky and Firas Trabelsi and Stephanie Winkler and Biao Zhang and Markus Freitag}, year={2025}, eprint={2502.12404}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.12404}, } ``` ## Extensions of this benchmark - https://huggingface.co/datasets/ZurichNLP/wmt24pp-rm (German → 6 varieties of Romansh) ## Helpful Python Constants ```python LANGUAGE_PAIRS = ( "en-ar_EG", "en-ar_SA", "en-bg_BG", "en-bn_IN", "en-ca_ES", "en-cs_CZ", "en-da_DK", "en-de_DE", "en-el_GR", "en-es_MX", "en-et_EE", "en-fa_IR", "en-fi_FI", "en-fil_PH", "en-fr_CA", "en-fr_FR", "en-gu_IN", "en-he_IL", "en-hi_IN", "en-hr_HR", "en-hu_HU", "en-id_ID", "en-is_IS", "en-it_IT", "en-ja_JP", "en-kn_IN", "en-ko_KR", "en-lt_LT", "en-lv_LV", "en-ml_IN", "en-mr_IN", "en-nl_NL", "en-no_NO", "en-pa_IN", "en-pl_PL", "en-pt_BR", "en-pt_PT", "en-ro_RO", "en-ru_RU", "en-sk_SK", "en-sl_SI", "en-sr_RS", "en-sv_SE", "en-sw_KE", "en-sw_TZ", "en-ta_IN", "en-te_IN", "en-th_TH", "en-tr_TR", "en-uk_UA", "en-ur_PK", "en-vi_VN", "en-zh_CN", "en-zh_TW", "en-zu_ZA", ) LANGUAGE_BY_CODE = { "ar_EG": "Arabic", "ar_SA": "Arabic", "bg_BG": "Bulgarian", "bn_BD": "Bengali", "bn_IN": "Bengali", "ca_ES": "Catalan", "cs_CZ": "Czech", "da_DK": "Danish", "de_DE": "German", "el_GR": "Greek", "es_MX": "Spanish", "et_EE": "Estonian", "fa_IR": "Farsi", "fi_FI": "Finnish", "fil_PH": "Filipino", "fr_CA": "French", "fr_FR": "French", "gu_IN": "Gujarati", "he_IL": "Hebrew", "hi_IN": "Hindi", "hr_HR": "Croatian", "hu_HU": "Hungarian", "id_ID": "Indonesian", "is_IS": "Icelandic", "it_IT": "Italian", "ja_JP": "Japanese", "kn_IN": "Kannada", "ko_KR": "Korean", "lt_LT": "Lithuanian", "lv_LV": "Latvian", "ml_IN": "Malayalam", "mr_IN": "Marathi", "nl_NL": "Dutch", "no_NO": "Norwegian", "pa_IN": "Punjabi", "pl_PL": "Polish", "pt_BR": "Portuguese", "pt_PT": "Portuguese", "ro_RO": "Romanian", "ru_RU": "Russian", "sk_SK": "Slovak", "sl_SI": "Slovenian", "sr_RS": "Serbian", "sv_SE": "Swedish", "sw_KE": "Swahili", "sw_TZ": "Swahili", "ta_IN": "Tamil", "te_IN": "Telugu", "th_TH": "Thai", "tr_TR": "Turkish", "uk_UA": "Ukrainian", "ur_PK": "Urdu", "vi_VN": "Vietnamese", "zh_CN": "Mandarin", "zh_TW": "Mandarin", "zu_ZA": "Zulu", } REGION_BY_CODE = { "ar_EG": "Egypt", "ar_SA": "Saudi Arabia", "bg_BG": "Bulgaria", "bn_BD": "Bangladesh", "bn_IN": "India", "ca_ES": "Spain", "cs_CZ": "Czechia", "da_DK": "Denmark", "de_DE": "Germany", "el_GR": "Greece", "es_MX": "Mexico", "et_EE": "Estonia", "fa_IR": "Iran", "fi_FI": "Finland", "fil_PH": "Philippines", "fr_CA": "Canada", "fr_FR": "France", "gu_IN": "India", "he_IL": "Israel", "hi_IN": "India", "hr_HR": "Croatia", "hu_HU": "Hungary", "id_ID": "Indonesia", "is_IS": "Iceland", "it_IT": "Italy", "ja_JP": "Japan", "kn_IN": "India", "ko_KR": "South Korea", "lt_LT": "Lithuania", "lv_LV": "Latvia", "ml_IN": "India", "mr_IN": "India", "nl_NL": "Netherlands", "no_NO": "Norway", "pa_IN": "India", "pl_PL": "Poland", "pt_BR": "Brazil", "pt_PT": "Portugal", "ro_RO": "Romania", "ru_RU": "Russia", "sk_SK": "Slovakia", "sl_SI": "Slovenia", "sr_RS": "Serbia", "sv_SE": "Sweden", "sw_KE": "Kenya", "sw_TZ": "Tanzania", "ta_IN": "India", "te_IN": "India", "th_TH": "Thailand", "tr_TR": "Turkey", "uk_UA": "Ukraine", "ur_PK": "Pakistan", "vi_VN": "Vietnam", "zh_CN": "China", "zh_TW": "Taiwan", "zu_ZA": "South Africa", } ```

# WMT24++ 本仓库收录了论文**《WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects》**(arxiv预印本链接:https://arxiv.org/abs/2502.12404)所发布的55个英语至目标语言(en→xx)语言对的人工翻译与译后编辑数据。 若您需获取机器翻译(MT)/大语言模型(LLM)的系统输出结果与自动评估指标分数,请参阅[MTME](https://github.com/google-research/mt-metrics-eval/tree/main?tab=readme-ov-file#wmt24-data)项目;若您需获取各文档源URL对应的图像资源,请访问[此处](https://huggingface.co/datasets/google/wmt24pp-images)。 ## 数据结构规范 每个语言对均存储于独立的JSON Lines(jsonl)文件中。每一行均为序列化后的JSON对象,包含以下字段: - `"lp"`:语言对标识(示例:`"en-de_DE"`)。 - `"domain"`:源文本所属领域,可选值为`"canary"`、`"news"`、`"social"`、`"speech"`或`"literary"`。 - `"document_id"`:源文本所属文档的唯一标识符。 - `"segment_id"`:该文本片段的全局唯一标识符。 - `"is_bad_source"`:布尔类型字段,用于标识该源文本是否为低质量文本(例如包含HTML标签、URL链接、表情符号等)。本文中所有标记为`true`的片段均已从评估集移除,我们建议您在使用时也执行相同的过滤操作。 - `"source"`:英语源文本。 - `"target"`:`"original_target"`的译后编辑版本,我们建议将该字段作为默认参考译文。 - `"original_target"`:原始参考译文。 ## 引用说明 若您使用本项目发布的任何数据,请引用以下论文: @misc{deutsch2025wmt24expandinglanguagecoverage, title={{WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects}}, author={Daniel Deutsch and Eleftheria Briakou and Isaac Caswell and Mara Finkelstein and Rebecca Galor and Juraj Juraska and Geza Kovacs and Alison Lui and Ricardo Rei and Jason Riesa and Shruti Rijhwani and Parker Riley and Elizabeth Salesky and Firas Trabelsi and Stephanie Winkler and Biao Zhang and Markus Freitag}, year={2025}, eprint={2502.12404}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.12404}, } ## 实用Python常量 python LANGUAGE_PAIRS = ( "en-ar_EG", "en-ar_SA", "en-bg_BG", "en-bn_IN", "en-ca_ES", "en-cs_CZ", "en-da_DK", "en-de_DE", "en-el_GR", "en-es_MX", "en-et_EE", "en-fa_IR", "en-fi_FI", "en-fil_PH", "en-fr_CA", "en-fr_FR", "en-gu_IN", "en-he_IL", "en-hi_IN", "en-hr_HR", "en-hu_HU", "en-id_ID", "en-is_IS", "en-it_IT", "en-ja_JP", "en-kn_IN", "en-ko_KR", "en-lt_LT", "en-lv_LV", "en-ml_IN", "en-mr_IN", "en-nl_NL", "en-no_NO", "en-pa_IN", "en-pl_PL", "en-pt_BR", "en-pt_PT", "en-ro_RO", "en-ru_RU", "en-sk_SK", "en-sl_SI", "en-sr_RS", "en-sv_SE", "en-sw_KE", "en-sw_TZ", "en-ta_IN", "en-te_IN", "en-th_TH", "en-tr_TR", "en-uk_UA", "en-ur_PK", "en-vi_VN", "en-zh_CN", "en-zh_TW", "en-zu_ZA", ) LANGUAGE_BY_CODE = { "ar_EG": "阿拉伯语", "ar_SA": "阿拉伯语", "bg_BG": "保加利亚语", "bn_BD": "孟加拉语", "bn_IN": "孟加拉语", "ca_ES": "加泰罗尼亚语", "cs_CZ": "捷克语", "da_DK": "丹麦语", "de_DE": "德语", "el_GR": "希腊语", "es_MX": "西班牙语", "et_EE": "爱沙尼亚语", "fa_IR": "波斯语", "fi_FI": "芬兰语", "fil_PH": "他加禄语(菲律宾语)", "fr_CA": "法语", "fr_FR": "法语", "gu_IN": "古吉拉特语", "he_IL": "希伯来语", "hi_IN": "印地语", "hr_HR": "克罗地亚语", "hu_HU": "匈牙利语", "id_ID": "印度尼西亚语", "is_IS": "冰岛语", "it_IT": "意大利语", "ja_JP": "日语", "kn_IN": "卡纳达语", "ko_KR": "韩语", "lt_LT": "立陶宛语", "lv_LV": "拉脱维亚语", "ml_IN": "马拉雅拉姆语", "mr_IN": "马拉地语", "nl_NL": "荷兰语", "no_NO": "挪威语", "pa_IN": "旁遮普语", "pl_PL": "波兰语", "pt_BR": "葡萄牙语", "pt_PT": "葡萄牙语", "ro_RO": "罗马尼亚语", "ru_RU": "俄语", "sk_SK": "斯洛伐克语", "sl_SI": "斯洛文尼亚语", "sr_RS": "塞尔维亚语", "sv_SE": "瑞典语", "sw_KE": "斯瓦希里语", "sw_TZ": "斯瓦希里语", "ta_IN": "泰米尔语", "te_IN": "泰卢固语", "th_TH": "泰语", "tr_TR": "土耳其语", "uk_UA": "乌克兰语", "ur_PK": "乌尔都语", "vi_VN": "越南语", "zh_CN": "普通话", "zh_TW": "普通话", "zu_ZA": "祖鲁语", } REGION_BY_CODE = { "ar_EG": "埃及", "ar_SA": "沙特阿拉伯", "bg_BG": "保加利亚", "bn_BD": "孟加拉国", "bn_IN": "印度", "ca_ES": "西班牙", "cs_CZ": "捷克", "da_DK": "丹麦", "de_DE": "德国", "el_GR": "希腊", "es_MX": "墨西哥", "et_EE": "爱沙尼亚", "fa_IR": "伊朗", "fi_FI": "芬兰", "fil_PH": "菲律宾", "fr_CA": "加拿大", "fr_FR": "法国", "gu_IN": "印度", "he_IL": "以色列", "hi_IN": "印度", "hr_HR": "克罗地亚", "hu_HU": "匈牙利", "id_ID": "印度尼西亚", "is_IS": "冰岛", "it_IT": "意大利", "ja_JP": "日本", "kn_IN": "印度", "ko_KR": "韩国", "lt_LT": "立陶宛", "lv_LV": "拉脱维亚", "ml_IN": "印度", "mr_IN": "印度", "nl_NL": "荷兰", "no_NO": "挪威", "pa_IN": "印度", "pl_PL": "波兰", "pt_BR": "巴西", "pt_PT": "葡萄牙", "ro_RO": "罗马尼亚", "ru_RU": "俄罗斯", "sk_SK": "斯洛伐克", "sl_SI": "斯洛文尼亚", "sr_RS": "塞尔维亚", "sv_SE": "瑞典", "sw_KE": "肯尼亚", "sw_TZ": "坦桑尼亚", "ta_IN": "印度", "te_IN": "印度", "th_TH": "泰国", "tr_TR": "土耳其", "uk_UA": "乌克兰", "ur_PK": "巴基斯坦", "vi_VN": "越南", "zh_CN": "中国", "zh_TW": "中国台湾", "zu_ZA": "南非", }
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
wmt24pp数据集是一个包含55种英语到其他语言翻译和后期编辑数据的数据集,主要用于机器翻译和大型语言模型的研究。数据集以jsonl格式存储,每个条目包含语言对、领域、文档ID、段ID、源文本质量标记、源文本、目标文本和原始目标文本等信息。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作