five

NICKLE: The Neungyule Interlanguage Corpus of Korean Learners of English

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14909529
下载链接
链接失效反馈
官方服务:
资源简介:
The corpus was constructed as a complementary resource for an English-Korean bilingual dictionary, the NeungYule-Longman English-Korean Dictionary. As a result, it may not be fully balanced.   Basic Information:  The size of the corpus is approximately 1 mil. tokens, including both written and spoken components. (in the ratio of approximately 9:1). The data is divided into several text types or registers according to the topics and communicative contexts. However, the usable size may be smaller after removing duplicate and irrelevant texts, depending on your research needs. Proficiency levels are not explicitly coded in the files, as they were collected from several universities across the country, each using different proficiency standards. The majority of the texts range from basic to pre-intermediate to intermediate levels, with some advanced-level texts included. You can identify advanced texts based on university names in the header or text lengths. When using the corpus, I typically refer to the source information (i.e., the university that produced the text) for proficiency-level insights. Annotation & Format: The corpus is not error-tagged or POS-tagged due to practical constraints. Only a few files had been error-tagged for testing an error-tagging scheme. However, automatic large-scale error tagging was not feasible for this corpus. If you need POS tagging, you can use any available NLP tools (e.g., TreeTagger, spaCy) or I can assist you in tagging the corpus if needed. The corpus is stored in XML format, following TEI (Text Encoding Initiative) standards.
创建时间:
2025-02-22
二维码
社区交流群
二维码
科研交流群
商业服务