localization-xml-mt
收藏arXiv2020-06-24 更新2024-06-21 收录
下载链接:
https://github.com/salesforce/localization-xml-mt
下载链接
链接失效反馈官方服务:
资源简介:
数据集名为localization-xml-mt,由Salesforce公司创建,旨在推动结构化文本本地化研究。该数据集包含从企业软件平台在线文档中收集的XML结构平行文本段,这些网页已由专业翻译人员从英语翻译成16种语言,并由领域专家维护。每个语言对约有100,000个文本段可用。数据集主要用于构建和评估从英语到七种目标语言的翻译模型,以及一种非英语语言对,展示了其明确的17×16翻译设置潜力。通过学习带XML标签的翻译,提高了翻译准确性,并通过束搜索准确生成XML结构。此外,数据集还讨论了使用复制机制的权衡,重点关注数值词和命名实体的翻译,并提供了详细的人工分析,以了解模型输出与人类翻译之间的差距,适用于实际应用中的后期编辑。
The dataset, named localization-xml-mt, was developed by Salesforce to advance research on structured text localization. It comprises parallel text segments with XML structures collected from online documentation of enterprise software platforms. These web documents were professionally translated from English into 16 languages and curated by domain experts. Approximately 100,000 text segments are available for each language pair. This dataset is primarily utilized to construct and evaluate translation models from English to seven target languages, as well as one non-English language pair, showcasing the potential of its well-specified 17×16 translation setup. Training on translations with XML tags enhances translation accuracy and enables precise generation of XML structures via beam search. Furthermore, the dataset explores trade-offs associated with copy mechanisms, with a focus on the translation of numerical terms and named entities. It also offers detailed human-centric analysis to uncover gaps between model outputs and human translations, making it suitable for post-editing in practical applications.
提供机构:
Salesforce
创建时间:
2020-06-24



