Acronym Corpus
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/rtotheich/acronym_corpus/tree/main
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是为了评估机器翻译系统在缩略语处理方面的表现而创建的,包含437个长格式-短格式(LF-SF)对。这些对是从一个包含13,500篇摘要的语料库中获取的,该语料库是从HAL数据库中抓取的。该数据集旨在提高机器翻译系统在缩略语解析方面的能力,并确保不包含任何冒犯性内容或个人信息。其规模为437个长格式-短格式对,任务重点是缩略语的消歧和翻译。
This dataset was created to evaluate the performance of machine translation systems in abbreviation handling. It includes 437 long-form-short-form (LF-SF) pairs, which were extracted from a corpus of 13,500 abstracts scraped from the HAL database. The dataset is designed to enhance the abbreviation parsing capabilities of machine translation systems, and guarantees that it contains no offensive content or personal identifiable information. With a total of 437 LF-SF pairs, the core task of this dataset focuses on abbreviation disambiguation and translation.
提供机构:
HAL repository



