five

Phrase Detectives Corpus Version 2

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2019T10
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Phrase Detectives Corpus Version 2 was developed by the <a href="https://www.essex.ac.uk/csee/">School of Computer Science and Electronic Engineering at the University of Essex</a> and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the <a href="https://anawiki.essex.ac.uk/phrasedetectives">Phrase Detectives Game</a>, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference. This release constitutes a new version of the Phrase Detectives Corpus (<a href="../../../LDC2017T08">LDC2017T08</a>) that adds significantly more annotated tokens to the data set and supplies for each markable a substantial number of judgments expressed by the players and a silver label annotation based on the probabilistic aggregation method for anaphoric information.</p><br> <p>GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts. Two projects that collect linguistic resources via Phrase Detectives and other similar language-oriented GWAPs are <a href="http://dali.eecs.qmul.ac.uk/">DALI</a> (Disagreements and Language Interpretation), led by Queen Mary University of London and the University of Essex, and the LDC <a href="https://www.ldc.upenn.edu/collaborations/current-projects/nieuw">NIEUW</a> (Novel Incentives and Workflows in Linguistic Data Annotation) project through its game site <a href="https://lingoboingo.org/">Lingo Boingo</a>, in collaboration with Queen Mary University, the University of Essex and other partners.</p><br> <h3>Data</h3><br> <p>The documents in the corpus are taken from <a href="https://www.wikipedia.org/">Wikipedia</a> articles and from narrative text in <a href="https://www.gutenberg.org/">Project Gutenberg</a>.</p><br> <p>The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (<a href="../../../LDC2013T22">LDC2013T22</a>). Players were asked to classify markables as <em>referring</em> or <em>non-referring</em>. Referring noun phrases could be classified either as <em>discourse-new</em> or <em>discourse-old</em> (referring to the same entity as a previous mention). Two types of non-referring expressions are identified: <em>expletives</em> and <em>predicative NPs</em> (called 'properties'). <em>Discourse-old</em> markables include so-called split antecedent plurals, as in <em>Mary met John. They had dinner together</em>.</p><br> <p>All player judgments are stored in MAS-XML format; they average 20 judgments per markable, up to 90 judgments in one case. A silver label extracted from those judgments using the MPA probabilistic annotation method (Paun et. al, 2018) is also provided.</p><br> <p>Wikipedia articles are presented as html, and all other source files are presented as plain text. All text is encoded as UTF-8.</p><br> <p>Annotations are released in three formats: (1) MAS-XML (the format in the first release), (2) a CONLL-style format based on the CoNLL <a href="http://conll.cemantix.org/2011/introduction.html">2011</a> and <a href="http://conll.cemantix.org/2012/introduction.html">2012</a> shared tasks on coreference and (3) <a href="http://anawiki.essex.ac.uk/dali/crac18/crac18_shared_task.html">CRAC 2018</a> format.</p><br> <h3>Samples</h3><br> <p>Please view the following samples:</p><br> <ul><br> <li><a href="desc/addenda/LDC2019T10.src.txt">Source</a></li><br> <li><a href="desc/addenda/LDC2019T10.conll.txt">CoNLL</a></li><br> <li><a href="desc/addenda/LDC2019T10.crac.txt">CRAC</a></li><br> <li><a href="desc/addenda/LDC2019T10.xml">MAS-XML</a></li><br> </ul><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2019 University of Essex, © 2019 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作