five

Tagged Chinese Gigaword

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2007T03
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. </p> <p>In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese.</p> <p>All sources have been categorized into four distinct "types":</p> <ul> <li> <strong>story</strong>: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.</li> <li> <strong>multi</strong>: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.</li> <li> <strong>advis</strong>: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."</li> <li> <strong>other</strong>: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.</li> <h3>Data</h3> <p>The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents. </p> <table> <td>Source</td> <td>#Files</td> <td>Rzip-MB</td> <td>Totl-MB</td> <td>K-wrds</td> <td>#DOCs</td> <td>CNA_CMN</td> <td>168</td> <td>994</td> <td>7363</td> <td>792195</td> <td>1769953</td> <td>XIN_CMN</td> <td>168</td> <td>615</td> <td>4535</td> <td>471110</td> <td>992261</td> <td>ZBN_CMN</td> <td>10</td> <td>40</td> <td>223</td> <td>28066</td> <td>41418</td> <td>TOTAL</td> <td>346</td> <td>1648</td> <td>12121</td> <td>1291371</td> <td>2803632</td> </table> <p>The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:</p> <table> <td>#DOCs</td> <td>K-wrds</td> <td colspan="2">type="advis":</td> <td>CNA_CMN</td> <td>8160</td> <td>751</td> <td>XIN_CMN</td> <td>6553</td> <td>711</td> <td>ZBN_CMN</td> <td> 0</td> <td>0</td> <td>TOTAL</td> <td>14713</td> <td>1462</td> </table> <table> <td colspan="2">type="multi":</td> <td>CNA_CMN</td> <td>30552</td> <td>23429</td> <td>XIN_CMN</td> <td>11329</td> <td>7516</td> <td>ZBN_CMN</td> <td>55</td> <td>41</td> <td>TOTAL</td> <td>41936</td> <td>30986</td> </table> <table> <td colspan="2">type="other":</td> <td>CNA_CMN</td> <td>100758</td> <td>40258</td> <td>XIN_CMN</td> <td>31255</td> <td>9999</td> <td>ZBN_CMN</td> <td>279</td> <td>130</td> <td>TOTAL</td> <td>132292</td> <td>50387</td> </table> <table> <td colspan="2">type="story":</td> <td>CNA_CMN</td> <td>1630483</td> <td>727748</td> <td>XIN_CMN</td> <td>943132</td> <td>452878</td> <td>ZBN_CMN</td> <td>41084</td> <td>27898</td> <td>TOTAL</td> <td>2614691</td> <td>1208524</td> </table> <p>The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.</p> <p>The test result is shown as follows:</p> <table> <td>Doc#</td> <td>RefWord#</td> <td>TestWord#</td> <td>MatchWord#</td> <td>Recall (%)</td> <td>Precision (%)</td> <td>F-Score (%)</td> <td>Bakeoff 2005</td> <td>190</td> <td>116509</td> <td>116443</td> <td>112091</td> <td>96.2</td> <td>96.3</td> <td>96.2</td> <td>Bakeoff 2006</td> <td>148</td> <td>90405</td> <td>90327</td> <td>87332</td> <td>96.6</td> <td>96.7</td> <td>96.6</td> </table> <p>Note:</p> <p>Recall=MatchWord# / RefWord#</p> <p>Precision=MatchWord# / TestWord#</p> <p>F-Score=2 * Recall * Precision / (Recall + Precision)</p> <h3>Samples</h3> <p>For an example of the data contained in this corpus, please view <a href="./desc/addenda/LDC2007T03.jpg" rel="nofollow">this screen capture(jpg)</a> of the annotated text. </p> </ul></br> Portions © 2005-2007 Academia Sinica, © 1991-2004 Central News Agency (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作