five

Morphologically Annotated Korean Text

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2004T03
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Morphologically Annotated Korean Text was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T03 and ISBN 1-58563-284-8.</p><br> <p>This is a collection of Korean text with annotated morphological analysis and part-of-speech tags. The source text was extracted from the <a href="../../../LDC2000T45">Korean Newswire</a> corpus. The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.</p><br> <p>The corpus is part of the Korean Treebank Phase 2. Between 2001 and 2002, the project was conducted under subcontract from Cogentex Inc., sponsor number Cogentex 5-33436. The text was tokenized and then automatically analyzed using Klex. Since there can be multiple possible morphological analyses, the output was fed through a statistical ranking system in order to select the best possible analysis for the word in the text environment. The part-of-speech tagged result was then manually corrected by Seung-yun Yang and Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department.</p><br> <h3>Data</h3><br> <p>The data consists of one single file, totalling approximately 880KB in uncompressed form.</p><br> <p>The text contains 1,574 sentences with 41,024 words and 77,173 morphemes in total. The text file is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed with Korean X-terminals such as hanterm, or by selecting Korean encoding in common web browsers such as Netscape or Internet Explorer.</p><br> <p>The data is formatted as follows: one head word per line, the word and its morphologically analyzed output are separated by a tab. Each morpheme is followed by "/" and its part-of-speech; morphemes are separated by "+". ^EOS is a special symbol denoting the end of a sentence.</p><br> <p>Morphologically analyzed and part-of-speech tagged data can be useful in the following applications: training of statistical morphological analyzers and part-of-speech taggers, evaluation of pre-existing morphological analyzers and part-of-speech taggers.</p><br> <p>The morphologically tagged output is compatible with <a href="http://catalog.ldc.upenn.edu/LDC2004L01" rel="nofollow"> Klex: Finite-State Lexical Transducer for Korean</a>. It also conforms to the <a href="ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/" rel="nofollow"> Korean Treebank POS annotation standards</a>.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2004T03.txt">sample</a>.</p><br> <h3>Updates</h3><br> <p>There are no updates available at this time.</p><br> <h3>Sponsorship</h3><br> <p>The Morphologically Annotated Korean Text corpus was funded in part through a 5-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via <a href="http://www.talkbank.org" rel="nofollow">TalkBank</a>, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.</p><br> <h3>Note</h3><br> <p>The cost of the first 50 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 50 copies are distributed, additional copies will be available for the cost of $300.</p></br> Portions © 1994-2000 Korean Press Agency, © 2004 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作