five

Spanish Gigaword Second Edition

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2009T21
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Spanish Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates <a href="http://catalog.ldc.upenn.edu/LDC2006T12" rel="nofollow">Spanish Gigaword First Edition (LDC2006T12)</a> and adds data collected from January 1, 2006 through December 31, 2008.</p><br> <p>The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows:</p><br> <ul><br> <li>Agence France-Presse, Spanish Service (afp_spa) May 1994 - Dec 2008</li><br> <li>Associated Press Worldstream, Spanish (apw_spa) Nov 1993 - Dec 2008</li><br> <li>Xinhua News Agency, Spanish Service (xin_spa) Sep 2001 - Dec 2008</li><br> </ul><br> <p>The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story.</p><br> <h3>Data</h3><br> <p>The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e. approximately 7 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes and the K-wrds numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.</p><br> <table><br> <tbody><br> <tr><br> <td>Source</td><br> <td>#Files</td><br> <td>Gzip-MB</td><br> <td>Totl-MB</td><br> <td>K-wrds</td><br> <td>#DOCs</td><br> </tr><br> <tr><br> <td>AFP_SPA</td><br> <td>175</td><br> <td>1182</td><br> <td>3512</td><br> <td>506562</td><br> <td>1748787</td><br> </tr><br> <tr><br> <td>APW_SPA</td><br> <td>180</td><br> <td>886</td><br> <td>2721</td><br> <td>402718</td><br> <td>1244811</td><br> </tr><br> <tr><br> <td>XIN_SPA</td><br> <td>88</td><br> <td>405</td><br> <td>1238</td><br> <td>182543</td><br> <td>734356</td><br> </tr><br> <tr><br> <td>TOTAL</td><br> <td>443</td><br> <td>2453</td><br> <td>7471</td><br> <td>1091823</td><br> <td>3727954</td><br> </tr><br> </tbody><br> </table><br> <p>The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated.</p><br> <table><br> <tbody><br> <tr><br> <td>Text-MB</td><br> <td>K-wrds</td><br> <td>#DOCs</td><br> </tr><br> <tr><br> <td colspan="4">type=advis:</td><br> </tr><br> <tr><br> <td>AFP_SPA</td><br> <td>144</td><br> <td>20520</td><br> <td>45446</td><br> </tr><br> <tr><br> <td>APW_SPA</td><br> <td>41</td><br> <td>6173</td><br> <td>11112</td><br> </tr><br> <tr><br> <td>XIN_SPA</td><br> <td>0</td><br> <td>0</td><br> <td>0</td><br> </tr><br> <tr><br> <td>TOTAL</td><br> <td>185</td><br> <td>26693</td><br> <td>56558</td><br> </tr><br> <tr><br> <td colspan="4">type=multi:</td><br> </tr><br> <tr><br> <td>AFP_SPA</td><br> <td>84</td><br> <td>12711</td><br> <td>15346</td><br> </tr><br> <tr><br> <td>APW_SPA</td><br> <td>351</td><br> <td>55758</td><br> <td>107224</td><br> </tr><br> <tr><br> <td>XIN_SPA</td><br> <td>189</td><br> <td>29970</td><br> <td>56372</td><br> </tr><br> <tr><br> <td>TOTAL</td><br> <td>624</td><br> <td>98439</td><br> <td>178942</td><br> </tr><br> <tr><br> <td colspan="4">type=other:</td><br> </tr><br> <tr><br> <td>AFP_SPA</td><br> <td>275</td><br> <td>38665</td><br> <td>160815</td><br> </tr><br> <tr><br> <td>APW_SPA</td><br> <td>296</td><br> <td>40517</td><br> <td>162448</td><br> </tr><br> <tr><br> <td>XIN_SPA</td><br> <td>44</td><br> <td>6376</td><br> <td>50168</td><br> </tr><br> <tr><br> <td>TOTAL</td><br> <td>615</td><br> <td>85558</td><br> <td>373431</td><br> </tr><br> <tr><br> <td colspan="4">type=story:</td><br> </tr><br> <tr><br> <td>AFP_SPA</td><br> <td>2771</td><br> <td>434677</td><br> <td>1527180</td><br> </tr><br> <tr><br> <td>APW_SPA</td><br> <td>1875</td><br> <td>300274</td><br> <td>964027</td><br> </tr><br> <tr><br> <td>XIN_SPA</td><br> <td>911</td><br> <td>146199</td><br> <td>627816</td><br> </tr><br> <tr><br> <td>TOTAL</td><br> <td>5557</td><br> <td>881150</td><br> <td>3119023</td><br> </tr><br> </tbody><br> </table><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2009T21_full.jpg" rel="nofollow">sample</a>.</p></br> Portions © 1994-2008 Agence France Presse, © 1993-2008 The Associated Press, © 2001-2008 Xinhua News Agency, © 2006, 2009 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作