M3T - 多模态文档级机器翻译新基准数据集
收藏数据集概述
数据集名称
M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
数据集目的
该数据集旨在评估神经机器翻译(NMT)系统在翻译半结构化文档方面的能力,特别关注文档布局等视觉线索对翻译任务的影响。
数据集特点
- 针对文档级NMT系统,考虑了文档布局等视觉元素的重要性。
- 旨在解决现有NMT系统在处理复杂文本布局时的不足。
数据集使用许可
本数据集根据CC-BY-4.0许可证授权。
引用信息
若使用本数据集,请考虑引用以下文献:
@misc{hsu2024m3t, title={M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation}, author={Benjamin Hsu and Xiaoyu Liu and Huayang Li and Yoshinari Fujinuma and Maria Nadejde and Xing Niu and Yair Kittenplon and Ron Litman and Raghavendra Pappagari}, year={2024}, eprint={2406.08255}, archivePrefix={arXiv}, primaryClass={cs.CL} }
同时,也请引用原始数据集:
@inproceedings{pfitzmann-et-al, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S. and Staar, Peter}, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3534678.3539043}, doi = {10.1145/3534678.3539043}, }
@inproceedings{harley-et-al, author={Harley, Adam W. and Ufkes, Alex and Derpanis, Konstantinos G.}, booktitle={2015 13th International Conference on Document Analysis and Recognition (ICDAR)}, title={Evaluation of deep convolutional nets for document image classification and retrieval}, year={2015}, volume={}, number={}, pages={991-995}, doi={10.1109/ICDAR.2015.7333910} }




