wmt24pp-images
收藏魔搭社区2026-01-02 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/wmt24pp-images
下载链接
链接失效反馈官方服务:
资源简介:
# WMT24++ Source URLs & Images
This repository contains the source URLs and full-page document screenshots for each document in the data from
[WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects](https://arxiv.org/abs/2502.12404).
These images preserve the original document structure of the translation segments with any embedded images, and may be used for multimodal translation or language understanding.
If you are interested in the human translations and post-edit data, please see [here](https://huggingface.co/datasets/google/wmt24pp).
If you are interested in the MT/LLM system outputs and automatic metric scores, please see [MTME](https://github.com/google-research/mt-metrics-eval).
## Schema
Each row is a serialized JSON object for a source document with the following fields:
- `image`: The full-page screenshot of the source document. Images are fixed width (750px) and variable length, based on content. Where the source was no longer available, these are black 750x750px placeholder images.
- `document_id`: The unique ID that identifies the document the source came from.
- `original_url`: The original document url.
- `mirror_url`: Where the original source is no longer available, a url for an alternate archived mirror copy
- `source_available`: A boolean true / false value indicating whether the source is available, indicating whether a screenshot is present or a placeholder image.
## Citation
If you use any of the data released in our work, please cite the following paper:
```
@misc{deutsch2025wmt24expandinglanguagecoverage,
title={{WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects}},
author={Daniel Deutsch and Eleftheria Briakou and Isaac Caswell and Mara Finkelstein and Rebecca Galor and Juraj Juraska and Geza Kovacs and Alison Lui and Ricardo Rei and Jason Riesa and Shruti Rijhwani and Parker Riley and Elizabeth Salesky and Firas Trabelsi and Stephanie Winkler and Biao Zhang and Markus Freitag},
year={2025},
eprint={2502.12404},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12404},
}
```
# WMT24++ 源URL与全页截图
本仓库包含来自论文[WMT24++:将WMT24的语言覆盖范围拓展至55种语言与方言](https://arxiv.org/abs/2502.12404)所发布数据中每份源文档的源URL及全页文档截图。这些图像完整保留了翻译片段的原始文档结构及其中嵌入的所有图像,可用于多模态机器翻译或语言理解任务。
若需获取人工翻译与后编辑数据,请访问[此处](https://huggingface.co/datasets/google/wmt24pp)。
若需获取机器翻译(Machine Translation,MT)/大语言模型(Large Language Model,LLM)的系统输出与自动评测指标得分,请参考[MTME](https://github.com/google-research/mt-metrics-eval)。
## 数据结构规范
每一行均为一份源文档的序列化JSON对象,包含以下字段:
- `image`:源文档的全页截图。图像固定宽度为750像素,高度随内容可变;若源文档无法获取,则使用750×750像素的黑色占位图像。
- `document_id`:标识该源文档所属来源的唯一标识符。
- `original_url`:源文档的原始URL。
- `mirror_url`:当原始源文档无法获取时,用于访问存档镜像副本的备用URL。
- `source_available`:布尔值(true/false),用于标识源文档是否可获取,即当前存储的是实际截图还是占位图像。
## 引用声明
若您使用本工作发布的任何数据,请引用以下论文:
@misc{deutsch2025wmt24expandinglanguagecoverage,
title={{WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects}},
author={Daniel Deutsch and Eleftheria Briakou and Isaac Caswell and Mara Finkelstein and Rebecca Galor and Juraj Juraska and Geza Kovacs and Alison Lui and Ricardo Rei and Jason Riesa and Shruti Rijhwani and Parker Riley and Elizabeth Salesky and Firas Trabelsi and Stephanie Winkler and Biao Zhang and Markus Freitag},
year={2025},
eprint={2502.12404},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12404},
}
提供机构:
maas
创建时间:
2025-04-21



