five

BOLT English Discussion Forums

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2017T11
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>BOLT English Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.</p><br> <p>The DARPA <a href="https://www.ldc.upenn.edu/collaborations/current-projects/bolt">BOLT</a> (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release represents the unannotated English source data in the discussion forum genre.</p><br> <h3>Data</h3><br> <p>Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in English that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using <a href="https://github.com/CLD2Owners/cld2">CLD2</a>), and threads for which the results indicate a high probability of largely non-English content are listed in eng_suspect_LID.txt in the docs directory of this package.</p><br> <p>The corpus is comprised of zipped HTML and XML files. The HTML files are a raw HTML file downloaded from the discussion thread. If the thread spanned multiple URLs, it was stored as a concatenation of the downloaded HTML files. The XML files were converted from the raw HTML.</p><br> <p>&nbsp;</p><br> <h3>Acknowledgement</h3><br> <p>This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2017T11.html">html sample</a>&nbsp;and <a href="desc/addenda/LDC2017T11.xml">xml sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2017 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作