English Web Treebank
收藏DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2012T13
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.</p><br>
<h3>Data</h3><br>
<p>This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the <a href="http://projects.ldc.upenn.edu/gale/index.html" rel="nofollow">DARPA GALE</a> project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.</p><br>
<p>Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006.</p><br>
<p>Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006.</p><br>
<p>Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the <a href="http://verbs.colorado.edu/enronsent/" rel="nofollow">Enronsent Corpus</a>, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users.</p><br>
<p>The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available.</p><br>
<p>Question-answers are posts from Yahoo!s community-driven question-answering web site, <a href="http://answers.yahoo.com/" rel="nofollow">Yahoo! Answers</a>, where individuals submit and answer questions which may be on any topic. This data was collected in 2011; the dates of individual question-answers were not collected.</p><br>
<h3>Samples</h3></br>
Portions © 2012 Google Inc., © 2011 Yahoo! Inc., © 2012 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
搜集汇总
数据集介绍

背景与挑战
背景概述
English Web Treebank是一个英语网络文本树库数据集,包含超过25万个词和1.6万个句子,覆盖博客、新闻组、电子邮件、评论和问答五种网络文本类型。该数据集由Linguistic Data Consortium于2012年发布,手动标注了句法和词法结构,旨在支持语言技术研究,特别是解析方法的鲁棒性评估,适用于词性标注、解析和问答系统等应用。
以上内容由遇见数据集搜集并总结生成



