five

TIPSTER Complete

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC93T3A
下载链接
链接失效反馈
官方服务:
资源简介:
<p>LDC93T3A - Complete TIPSTER corpus</p><br> <p><a href="http://catalog.ldc.upenn.edu/LDC93T3B" rel="nofollow">LDC93T3B</a> - Volume 1 of the TIPSTER corpus</p><br> <p><a href="http://catalog.ldc.upenn.edu/LDC93T3C" rel="nofollow">LDC93T3C</a> - Volume 2 of the TIPSTER corpus</p><br> <p><a href="http://catalog.ldc.upenn.edu/LDC93T3D" rel="nofollow">LDC93T3D</a> - Volume 3 of the TIPSTER corpus</p><br> <p>TIPSTER is sometimes also called the Text Research Collection Volume or TREC.</p><br> <p>The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.</p><br> <p>The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by <a href="http://trec.nist.gov/" rel="nofollow">NIST</a>.</p><br> <p>&nbsp;</p><br> <table border="1"><br> <tbody><br> <tr><br> <td>Source (vol)</td><br> <td>Year</td><br> <td>Approx. # Words (Millions)</td><br> </tr><br> <tr><br> <td>Associated Press (1)</td><br> <td>1989</td><br> <td>40</td><br> </tr><br> <tr><br> <td>Associated Press (2)</td><br> <td>1988</td><br> <td>37</td><br> </tr><br> <tr><br> <td>Associated Press (3)</td><br> <td>1990</td><br> <td>37</td><br> </tr><br> <tr><br> <td>Wall Street Journal (1)</td><br> <td>1987</td><br> <td>20</td><br> </tr><br> <tr><br> <td>Wall Street Journal (1)</td><br> <td>1988</td><br> <td>17</td><br> </tr><br> <tr><br> <td>Wall Street Journal (1)</td><br> <td>1989</td><br> <td>6</td><br> </tr><br> <tr><br> <td>Wall Street Journal (2)</td><br> <td>1990</td><br> <td>11</td><br> </tr><br> <tr><br> <td>Wall Street Journal (2)</td><br> <td>1991</td><br> <td>22</td><br> </tr><br> <tr><br> <td>Wall Street Journal (2)</td><br> <td>1992</td><br> <td>5</td><br> </tr><br> <tr><br> <td>Dept. of Energy (1)</td><br> <td>&nbsp;</td><br> <td>28</td><br> </tr><br> <tr><br> <td>Federal Register (1)</td><br> <td>1989</td><br> <td>38</td><br> </tr><br> <tr><br> <td>Federal Register (2)</td><br> <td>1988</td><br> <td>30</td><br> </tr><br> <tr><br> <td>Ziff/Davis (1)</td><br> <td>&nbsp;</td><br> <td>36</td><br> </tr><br> <tr><br> <td>Ziff/Davis (2)</td><br> <td>1989-90</td><br> <td>26</td><br> </tr><br> <tr><br> <td>Ziff/Davis (3)</td><br> <td>1991-92</td><br> <td>50</td><br> </tr><br> <tr><br> <td>San Jose Mercury News (3)</td><br> <td>1991</td><br> <td>45</td><br> </tr><br> </tbody><br> </table><br> <p>&nbsp;</p><br> <p>The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the <a href="desc/addenda/LDC93T3B_WSJsample" rel="nofollow"> Wall Street Journal,</a> (1986, 1987, 1988, 1989), the <a href="desc/addenda/LDC93T3B_APsample" rel="nofollow">AP Newswire</a> (1989), the <a href="desc/addenda/LDC93T3B_FRsample" rel="nofollow">Federal Register</a> (1989), information from <a href="desc/addenda/LDC93T3B_CSsample" rel="nofollow">Computer Select</a> disks (Ziff-Davis Publishing) and short abstracts from the <a href="desc/addenda/LDC93T3B_DOEsample" rel="nofollow">Department of Energy</a>. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the <a href="desc/addenda/LDC93T3D_SJMercurysample" rel="nofollow">San Jose Mercury News</a> (1991), more AP newswire (1990) and about 250 megabytes of formatted <a href="desc/addenda/LDC93T3D_USPatent" rel="nofollow">U.S. Patents</a>. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.</p><br> <p>The three Tipster discs released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.</p></br>
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作