five

Korean Propbank

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T03
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Korean Propbank was developed by the Computer and Information Sciences Department at the University of Pennsylvania and is comprised of approximately 33,300 predicates annotated in 186,300 words of Korean text. The text used in Propbank comes from <a href="../../../LDC2002T26">Korean English Treebank Annotations (LDC2002T26)</a> and <a href="../../../LDC2006T09">Korean Treebank Version 2.0 (LDC2006T09)</a>. Each verb and adjective occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs and adjectives have also been tagged with coarse grained senses.</p><br> <h3>Data</h3><br> <p>This table gives a breakdown of the thousands of words and number of annotations contained in the corpus, broken down by source:</p><br> <table style="margin-top: 30px; margin-bottom: 30px;" border="1" width="25%"><br> <tbody><br> <tr><br> <td>Source</td><br> <td>K-words</td><br> <td>Predicates Annotated</td><br> </tr><br> <tr><br> <td>Virginia Corpus</td><br> <td>54.5</td><br> <td>9,590</td><br> </tr><br> <tr><br> <td>Newswire Corpus</td><br> <td>131.8</td><br> <td>23,700</td><br> </tr><br> <tr><br> <td>Total</td><br> <td>186.3</td><br> <td>33,300</td><br> </tr><br> </tbody><br> </table><br> <p>There are two basic components to Korean Propbank:</p><br> <ul><br> <li><strong>The Verb Lexicon:</strong> A frames file, consisting of one or more frame sets, has been created for each predicate occurring in the Treebank. These files serve as a reference for the annotators and for users of the data. 2,749 such files have been created, totaling about ~10 MB of uncompressed data. The XML format and KSC 5,601 character set encoding are used in the frames file.</li><br> <li><strong>The Annotation:</strong> There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate tokens. These predicate tokens include all those occurring in 54.5 K-words of the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens include all those occurring in 131.8 K-words of the Korean Treebank Version 2.0, totaling ~2,054 KB of uncompressed data.</li><br> </ul><br> <h3>Samples</h3><br> <p>For an example of this corpus, please view this <a href="desc/addenda/LDC2006T03.txt">sample (TXT)</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2001-2002 CoGenTex, Inc., © 1994-2000 Korean Press Agency, © 1998-2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作