English Gigaword Second Edition
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2005T12
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>English Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2005T12 and ISBN 1-58563-350-X. The English Gigaword corpus is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC. This is the second edition of the English Gigaword corpus.</p><br>
<p>This edition includes all of the contents in the first edition of the English Gigaword corpus (LDC2003T05) as well as new data from July 2002 through Dec 2004. Also, a new newswire source (the Central New Agency of Taiwan, English Service) has been added in this edition.</p><br>
<p>The five distinct international sources of English newswire included in this release are the following:</p><br>
<table><br>
<tbody><br>
<tr><br>
<td colspan="60%">Agence France-Presse, English Service</td><br>
<td colspan="20%">(afp_eng )</td><br>
</tr><br>
<tr><br>
<td colspan="60%">Associated Press Worldstream, English Service</td><br>
<td colspan="20%">(apw_eng)</td><br>
</tr><br>
<tr><br>
<td colspan="60%">Central News Agency of Taiwan, English Service</td><br>
<td colspan="20%">(cna_eng)</td><br>
</tr><br>
<tr><br>
<td colspan="60%">The New York Times Newswire Service</td><br>
<td colspan="20%">(nyt_eng)</td><br>
</tr><br>
<tr><br>
<td colspan="60%">The Xinhua News Agency, English Service</td><br>
<td colspan="20%">(xin_eng)</td><br>
</tr><br>
</tbody><br>
</table><br>
<h3>What's New In The Second Edition</h3><br>
<ul><br>
<li>New newswire data contents from July 2002 to December 2004 have been added for all of the four newswire sources that were represented in the first edition.</li><br>
<li>A new source, the Central News Agency of Taiwan English Service (CNA_ENG), has been added.</li><br>
<li>We have adopted a new naming scheme for filenames and DOC IDs. The new naming scheme represents the source names in a three-letter code and the language name in a three-letter code.</li><br>
<li>Minor formatting improvements (mostly line-wrapping) have been made to some of the data contents originally published in the first edition.</li><br>
</ul></br>
Portions © 1994-1997 and 2001-2004 Agence France-Presse, © 1994-2004 Associated Press, © 1997-2004 Central News Agency of Taiwan, © 1994-2004 New York Times, © 1995-2004 Xinhua News Agency, © 2005 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



