PHINC
收藏arXiv2020-04-21 更新2024-06-21 收录
下载链接:
https://doi.org/10.5281/zenodo.3605597
下载链接
链接失效反馈官方服务:
资源简介:
PHINC数据集是由印度理工学院甘地分校创建的一个大规模平行语料库,专注于英语-印地语混合的社交媒体文本。该数据集包含13,738条代码混合的英语-印地语句子和其对应的英语翻译,这些翻译由54名学生标注者手动完成。数据集的创建旨在促进代码混合机器翻译的研究,特别是在处理社交媒体平台上的非正式和多语言文本。PHINC数据集涵盖了多个社交平台,如Twitter和Facebook,并涉及多种主题,如体育、娱乐和新闻,旨在解决现有机器翻译系统在处理代码混合文本时的局限性。
The PHINC dataset is a large-scale parallel corpus created by the Indian Institute of Technology Gandhinagar, focusing on code-mixed English-Hindi social media texts. It contains 13,738 code-mixed English-Hindi sentences along with their corresponding English translations, which were manually completed by 54 student annotators. This dataset was developed to advance research on code-mixed machine translation, particularly for handling informal and multilingual texts on social media platforms. The PHINC dataset covers multiple social platforms such as Twitter and Facebook, involves diverse topics including sports, entertainment and news, and aims to address the limitations of existing machine translation systems when processing code-mixed texts.
提供机构:
印度理工学院甘地分校
创建时间:
2020-04-21



