Chinese Gigaword Fourth Edition
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2009T27
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3> <p>Chinese Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T27 and isbn 1-58563-527-8, is a comprehensive archive of newswire text data that has been acquired over several years by the LDC. This edition includes all of the contents in <a href="http://catalog.ldc.upenn.edu/LDC2007T38" rel="nofollow"> Chinese Gigaword Third Edition (LDC2007T38)</a> as well as newly collected data. In addition, four entirely new sources have been added in the fourth edition, Central News Service, Guangming Daily, Peoples Liberation Army Daily, and Peoples Daily. </p> <p>The eight distinct international sources of Chinese newswire included in this edition are the following: </p><ul> <li>Agence France Presse (afp_cmn) </li> <li>Central News Agency, Taiwan (cna_cmn) </li> <li>Central News Service (cns_cmn) </li> <li>Guangming Daily (gmw_cmn) </li> <li>Peoples Daily (pda_cmn) </li> <li>Peoples Liberation Army Daily (pla_cmn) </li> <li>Xinhua News Agency (xin_cmn) </li> <li>Zaobao Newspaper (zbn_cmn) </li> </ul><p>The seven-letter codes in the parentheses above are used for the directory names and data files for each source, and are also used (in ALL_CAPS) as part of the unique DOC id string assigned to each news article.</p> <h3>Data</h3> <p> The original data received by the LDC from AFP, Peoples Liberation Army Daily, Xinhua, and Zaobao were encoded in GB-2312, those from CNA were in Big-5, and those from GMW, CNS, and Peoples Daily were in a combination of GB-2312 and GB-18030. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to UTF-8 character encoding. </p> <h3>New in the Fourth Edition</h3> <ul> <li>Two years worth of new articles (January 2007 through December 2008) have been added to the Xinhua, Agence France Presse, and CNA data sets.</li> <li>Four new data sources have been added - Guangming Daily, Central News Service, Peoples Daily and Peoples Liberation Army daily, covering a timespan from November 2006 through December 2008.</li> </ul><h3>Samples</h3> <p>Please view this <a href="./desc/addenda/LDC2009T27.jpg" rel="nofollow">sample</a>.</p> <h3>Sponsorship</h3> <p>This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p> </br>
Portions © 2000-2008 Agence France Presse,© 1991-2008 Central News Agency (Taiwan),© 2006-2008 China Military Online, © 2006-2008 Chinanews.com, © 2006-2008 Guangming Daily, © 2006-2008 Peoples Daily, © 1998, 2000-2003 SPH AsiaOne, Ltd., © 1990-2008 Xinhua News Agency, © 2003, 2005, 2007, 2009 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



