TCNAEC
收藏ieee-dataport.org2025-03-22 收录
下载链接:
https://ieee-dataport.org/documents/tcnaec
下载链接
链接失效反馈官方服务:
资源简介:
In the domain of Natural Language Processing (NLP), the English Writing Fluency Improvement for non-native speakers, particularly in academic contexts, poses significant challenges. While Sentence-level Revision (SentRev) endeavors to address this concern, the existing evaluation corpus, SMITH, falls short in offering a robust and comprehensive assessment of the task. To bridge this gap, our research offers a novel evaluation corpus generation scheme, leading to the creation of Ten-Country Non-native Academic English Corpus (TCNAEC). A meticulous analysis revealed the superior characteristics of TCNAEC over SMITH in various dimensions. Our evaluation also uncovered intriguing linguistic phenomena, offering valuable insights for fellow researchers. In contrast, the Grammatical Error Correction (GEC) task, which shares similarities with SentRev, has been more extensively explored, resulting in a richer set of training and evaluation corpora. However, the distinctive attributes of SentRev present a heightened challenge in NLP implementation. The TCNAEC, representing ten countries, captures the unique English expression styles of non-native speakers worldwide, offering a more holistic view compared to the Japan-centric SMITH. Furthermore, while SMITH primarily revolves around computational linguistics, TCNAEC spans multiple disciplines, accentuating its comprehensiveness. The construction strategy of TCNAEC, ensuring semantic consistency between Draft and Reference, emphasizes meaningful structural variations, reflecting the stylistic disparities between non-academic and academic texts.
在自然语言处理(NLP)领域,对于非母语使用者在学术背景下的英语写作流畅度提升,面临着巨大的挑战。尽管句子级修订(SentRev)旨在解决这一问题,但现有的评估语料库SMITH在提供对任务进行全面且坚实的评估方面仍有不足。为了弥合这一差距,我们的研究提出了一个新颖的评估语料库生成方案,从而催生了十国非母语学术英语语料库(TCNAEC)。经过细致的分析,我们揭示了TCNAEC在多个维度上相较于SMITH的优越性。我们的评估还揭示了引人入胜的语言学现象,为同行研究者提供了宝贵的洞见。相比之下,与SentRev具有相似性的语法错误纠正(GEC)任务已经得到了更为广泛的探索,从而形成了一整套更为丰富的训练和评估语料库。然而,SentRev的独特属性在NLP实现中提出了更高的挑战。代表十个国家的TCNAEC捕捉了全球非母语使用者的独特英语表达风格,相较于以日本为中心的SMITH提供了更为全面的视角。此外,尽管SMITH主要围绕计算语言学展开,但TCNAEC涵盖了多个学科,凸显了其全面性。TCNAEC的构建策略确保了草稿与参考之间的语义一致性,强调了有意义的结构变化,反映了非学术文本与学术文本之间的风格差异。
提供机构:
IEEE Dataport



