five

TDT Pilot Study Corpus

收藏
DataCite Commons2021-07-01 更新2024-07-13 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC98T25
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>The TDT Pilot Study corpus was created to support an initiative in "topic detection and tracking." This initiative is directed toward computer processing of language data, both text and speech. The objective is namely to explore techniques for detecting the appearance of new and unexpected topics and for tracking the reappearance and evaluation of them. </p><h3>Data</h3> <p>The TDT corpus comprises a set of stories that includes both newswire (text) and broadcast news (speech). Each story is represented as a stream of text, in which the text is either taken directly from the newswire (Reuters) or is a manual transcription of the broadcast news speech (CNN). The corpus spans the period from July 1, 1994 to June 30, 1995. It contains approximately 16,000 stories, with about half taken from Reuters newswire and half from CNN broadcast news transcripts. </p><p>An integral and key part of the corpus is the annotation of the corpus in terms of the events discussed in the stories. 25 events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. Annotation data for these events are included in the corpus and provide a basis for training TDT systems. </p><h3>Updates</h3> There are no updates at this time. </br> Portions © 1994-1995 Cable News Network, LP, LLLP, © 1994-1995 Reuters America, Inc., © 1998 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
TDT Pilot Study Corpus是一个用于主题检测与跟踪研究的数据集,包含1994-1995年间约16,000个路透社和CNN的新闻故事,并标注了25个关键事件,支持相关系统的训练。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作