five

English Gigaword

收藏
Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2003T05
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC. Four distinct international sources of English newswire are represented here: Agence France Press English Service (afe) Associated Press Worldstream English Service (apw) The New York Times Newswire Service (nyt) The Xinhua News Agency English Service (xie) Data Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward. Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were delivered by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication. Please follow this link for a sample file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one Paragraph tags are only used if the "type" attribute of the DOC happens to be "story" Note that all data files use the UNIX-standard " " form of line termination, and text lines are generally wrapped to a width of 80 characters or less For this release, all sources have received a uniform treatment in terms of quality control and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types." The classification is indicated by the "type="string" " attribute that is included in each opening DOC tag. The four types are: story, multi, advis and other. Statistics regarding the quantities of data for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFE 44 417 1216 170969 656269 APW 91 1213 3647 539665 1477466 NYT 96 2104 5906 914159 1298498 XIE 83 320 940 131711 679007 TOTAL 314 4054 11709 1756504 4111240 Updates There are no updates available at this time. Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002 Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua News Agency, © 2002 Trustees of the University of Pennsylvania

数据集介绍 英文千兆词语料库(English Gigaword)由语言数据联盟(Linguistic Data Consortium,LDC)制作,目录编号为LDC2003T05,ISBN为1-58563-260-0,以DVD形式发行。该语料库是语言数据联盟历经多年收集的英文新闻文本数据综合归档库,涵盖四大国际英文新闻源:法国新闻社英文专线(afe)、美联社全球专线英文服务(apw)、《纽约时报》新闻专线(nyt)以及新华通讯社英文专线(xie)。 ## 数据说明 本语料库中的多数内容此前已由语言数据联盟在多套早期语料库中发布,尤其是北美新闻文本语料库(LDC95T21、LDC98T30)、各类话题检测与跟踪(Topic Detection and Tracking,TDT)语料库以及AQUAINT文本语料库(LDC2002T31)。但仍有大量内容为首次发布:包括全部法国新闻社内容、1995年与2001年的新华通讯社内容,以及2001年2月及之后的《纽约时报》和美联社专线内容。 所有数据文件的命名格式统一为:三位字母前缀+六位日期码(代表对应新闻源供稿的年份与月份)+".gz"文件扩展名,其中".gz"表示该文件已通过GNU gzip压缩工具(RFC 1952)进行压缩。换言之,每个文件包含语言数据联盟当月从对应新闻源获取的全部可用数据。 所有文本数据均采用标准通用标记语言(Standard Generalized Markup Language,SGML)格式,使用极简标记结构;文本仅包含可打印ASCII字符与空白符。本语料库已通过标准SGML解析工具(nsgmls)结合本发布包中提供的文档类型定义(Document Type Definition,DTD)文件完成完整校验。可点击此链接查看示例文件。所有数据文件通用的标记结构可归纳如下: 1. 标题元素为可选项——并非所有DOC元素均包含标题; 2. 发稿地元素亦为可选项——并非所有DOC元素均包含发稿地; 3. 仅当DOC元素的"type"属性为"story"时,才会使用段落标记。 需注意,所有数据文件均采用UNIX标准换行格式,文本行通常换行至80字符以内。本次发布中,所有新闻源均经过统一的质量管控,我们还对DOC单元进行了初步(且近似)的分类,将其划分为四种不同的"类型"。分类信息通过每个DOC起始标签中的`type="string"`属性标注。四类分别为:story、multi、advis与other。 各新闻源的数据量统计信息汇总如下: | 数据源 | 文件数量 | 压缩后大小(MB) | 未压缩总大小(MB) | 按空白符分隔的Token总数(千) | DOC单元总数 | | ------ | ------ | ------- | ------- | ------ | ------- | | AFE | 44 | 417 | 1216 | 170969 | 656269 | | APW | 91 | 1213 | 3647 | 539665 | 1477466 | | NYT | 96 | 2104 | 5906 | 914159 | 1298498 | | XIE | 83 | 320 | 940 | 131711 | 679007 | | 总计 | 314 | 4054 | 11709 | 1756504| 4111240 | 注:"Totl-MB"列代表文件未压缩时的数据量(总数据量近12吉字节);"Gzip-MB"列代表DVD-ROM中存储的压缩文件总大小;"K-wrds"列则为去除所有SGML标记后,按空白符分隔的所有类型Token总数。 ## 更新说明 目前暂无可用更新。 ## 版权声明 部分内容版权所有:© 1994-1997及2001-2002 法国新闻社,© 1994-2002 美联社,© 1994-2002 《纽约时报》,© 1995-2001 新华通讯社,© 2002 宾夕法尼亚大学理事会
创建时间:
2024-01-31
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
English Gigaword是一个包含约18亿字英语新闻文本的语料库,数据来源于四个国际新闻机构,适用于信息检索和自然语言处理等任务。数据以SGML格式存储,并经过压缩处理。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作