ACL word segmentation correction
收藏DataCite Commons2025-01-28 更新2025-04-17 收录
下载链接:
https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/DATA/VK99LU
下载链接
链接失效反馈官方服务:
资源简介:
The data in this collection consists of two parallel directories, one ("raw") containing the raw text of 18850 articles from the ACL 2013/02 collection, the other ("re-segmented") the word-resegmented version of these articles, obtained using nematus, a seq2seq neural model used for machine translation. The motivation for the work was that spurious spaces in the text seemed to be very common, particularly in older papers, obtained by OCR-ing scanned papers.
提供机构:
heiDATA
创建时间:
2019-07-15



