five

Manually Annotated Sub-Corpus First Release

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2010T22
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>The Manually Annotated Sub-Corpus First Release (MASC I), Linguistic Data Consortium (LDC) catalog number LDC2010T22 and isbn 1-58563-569-3, is the first of three releases of 500,000 words of MASC data developed as part of the <a href="http://www.americannationalcorpus.org/" rel="nofollow"> American National Corpus</a> (ANC) project. MASC I consists of approximately 80,000 words of contemporary spoken and written American English annotated for a variety of linguistic phenomena. The <a href="http://www.americannationalcorpus.org/MASC/Home.html" rel="nofollow">MASC</a> project is sponsored by the National Science Foundation and was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. Researchers from Vassar College, Columbia University and the International Computer Science Institute, University of California at Berkeley are the principal participants the <a href="http://wordnet.princeton.edu/" rel="nofollow">WordNet</a> project provides consulting.</p><br> <p>The source texts in MASC I are drawn from the open portion of the <a href="http://catalog.ldc.upenn.edu/LDC2005T35" rel="nofollow">American National Corpus (ANC) Second Release LDC2005T35</a>, which includes written texts and spoken transcripts of American English from a broad range of genres produced since 1990 and from the <a href="http://catalog.ldc.upenn.edu/LDC2009T10">Language Understanding Annotation Corpus LDC2009T10</a>, (LU Corpus), a collection of various genres including broadcast, newswire, email and telephone speech annotated for committed belief, event and entity coreference, dialog acts and temporal relations. All of the words of data in MASC I have validated annotations for token, part of speech, sentence boundary, noun chunks, verb chunks, named entities and <a href="http://www.cis.upenn.edu/~treebank/" rel="nofollow">Penn Treebank</a> syntax. Full-text <a href="http://framenet.icsi.berkeley.edu/" rel="nofollow">FrameNet</a> annotations are available for seventeen texts and WordNet word sense annotations are available for 1000 occurrences of each of fifty-three words. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects. Software and services available from the <a href="http://www.anc.org/MASC/Home.html" rel="nofollow">ANC project website</a> enable transduction of MASC into a wide variety of physical formats.</p><br> <h3>Data</h3><br> <p>The MASC directory contains two folders: masc-1.0.3 and masc_wordsense. masc-1.0.3 contains the actual MASC corpus and consists of two folders, spoken and written. The spoken folder contains data and annotations for spoken material, and the written folder contains the same for written texts. The files in each of the respective folders have naming conventions that describe the contents of the file.</p><br> <p>masc_wordsense contains the MASC sentence samples with word sense annotations using WordNet sense numbers as the annotation values.</p><br> <h3>Updates</h3><br> <p>Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at <a href="http://catalog.ldc.upenn.edu/LDC2010T17" rel="nofollow">LDC2010T22</a>.</p><br> <h3>Samples</h3><br> <p>Contact: <a rel="nofollow"> <strong>ldc@ldc.upenn.edu</strong> </a> &copy; 2010 <a href="http://www.ldc.upenn.edu" rel="nofollow"> <strong>Linguistic Data Consortium</strong> </a>, <a href="http://www.upenn.edu" rel="nofollow"> <strong>Trustees of the University of Pennsylvania</strong> </a>. All Rights Reserved.</p></br> Portions © 2000 The Associated Press, © 1987-1989 Dow Jones &amp; Company, Inc., © 2000 New York Times, © 1997-2002, 2010 Trustees of the University of Pennsylvania <br><br> Contact: <a rel="nofollow"> <b>ldc@ldc.upenn.edu</b> </a> © 2010 <a href="http://www.ldc.upenn.edu" rel="nofollow"> <b>Linguistic Data Consortium</b> </a>, <a href="http://www.upenn.edu" rel="nofollow"> <b>Trustees of the University of Pennsylvania</b> </a>. All Rights Reserved.
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作