five

Prague Dependency Treebank 2.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T01
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>The Prague Dependency Treebank 2.0 (PDT 2.0) was developed by Charles University and contains approximately 2 million words of Czech text with complex and interlinked morphological, syntactic, and complex semantic annotation. In addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level.</p><br> <p>PDT 2.0 follows <a href="../../../LDC2001T10">Prague Dependency Treebank 1.0 (LDC2001T10)</a> and is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation, and language analysis are included. Extensive documentation (in English) is provided as well.</p><br> <h3>Data</h3><br> <p>The data in this corpus comes from four sources:</p><br> <ul><br> <li>Lidov&eacute; Noviny (daily newspapers), 1991, 1994, 1995</li><br> <li>Mlad&aacute; Fronta Dnes (daily newspapers), 1992</li><br> <li>Českomoravsk&yacute; Profit (business weekly), 1994</li><br> <li>Vesm&iacute;r (scientific journal), 1992, 1993</li><br> </ul><br> <p>The texts in electronic form have been provided by the <a href="https://ucnk.ff.cuni.cz/en/">Institute of the Czech National Corpus</a>.</p><br> <p>The data in PDT 2.0 are annotated on three layers&mdash;the morphological layer, analytical layer, and tectogrammatical layer. The following table shows the breakdown by annotation layer and source of data amounts in K-words (thousands of words). Each subsequent layer is additive, so everything that was annotated at the a-layer was also annotated at the m-layer, and everything annotated at the t-layer was also annotated at the other two layers.</p><br> <table style="margin-top: 30px; margin-bottom: 30px;" border="1" width="60%"><br> <tbody><br> <tr><br> <td>Layer</td><br> <td>Lidov&eacute; Noviny</td><br> <td>Mlad&aacute; Fronta Dnes</td><br> <td>Českomoravsk&yacute; Profit</td><br> <td>Vesm&iacute;r</td><br> <td>Total</td><br> </tr><br> <tr><br> <td>m-layer</td><br> <td>1,235</td><br> <td>373</td><br> <td>171</td><br> <td>178</td><br> <td>1,957</td><br> </tr><br> <tr><br> <td>a-layer</td><br> <td>920</td><br> <td>234</td><br> <td>171</td><br> <td>178</td><br> <td>1,504</td><br> </tr><br> <tr><br> <td>t-layer</td><br> <td>640</td><br> <td>119</td><br> <td>74</td><br> <td>0</td><br> <td>833</td><br> </tr><br> </tbody><br> </table><br> <p>The primary data format for PDT 2.0 is an XML6-based format called PML. A SGML-based format, called CSTS, has been the primary format of PDT 1.0. It is now used only as an intermediate format in older NLP tools (such as taggers and parsers).</p><br> <p>As usual, the data are divided into three groups: the training data, the development test data and the evaluation test data. The training data cover approximately 80%, development 10% and evaluation 10% of the whole set of data (these proportions hold for all the three layers of annotation).</p><br> <h3>Samples</h3><br> <p>For an example of the data in this corpus, please view these <a href="http://ufal.mff.cuni.cz/pdt2.0/visual-data/sample/index.htm" rel="nofollow">samples</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 1991, 1994,1995 Lidové noviny daily newspapers, © 1992 Mladá fronta Dnes daily newspapers, © 1994 Ceskomoravský Profit business weekly, © 1992-1993 Vesmír scientific magazine, Academia Publishers, © 1996-2005 Institute of Formal and Applied Linguistics and Center for Computational Linguistics, Faculty of Mathematics and Physics, Charles University, © 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作