five

Chinese Treebank 2.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2001T11
下载链接
链接失效反馈
官方服务:
资源简介:
<p>The Chinese Treebank 2.0 was produced by:</p><br> <p>Principal Investigators: Martha Palmer, Mitch Marcus, Tony Kroch</p><br> <p>Consultants: Martha Palmer, Mitch Marcus, Tony Kroch, Shizhe Huang, Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc</p><br> <p>Project Managers and Guideline Designers: Fei Xia, Nianwen Xue</p><br> <p>Annotators: Fu-Dong Chiou, Nianwen Xue</p><br> <p>Programming support: Zhibiao Wu</p><br> <h3>Introduction</h3><br> <p>Published by the Linguistic Data Consortium (LDC), catalog number LDC2001T11 and ISBN 1-58563-204-X.</p><br> <p>The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at <a href="https://www.cs.brandeis.edu/~llc/page2/page2.html">The Chinese Treebank Project</a>. Chinese Treebank 2.0 supersedes and replaces the Chinese Penn Treebank Final Release (LDC2000T48 ISBN 1-58563-187-6).</p><br> <h3>Data</h3><br> <table><br> <tbody><br> <tr><br> <td>Size:</td><br> <td>About 100K words, 325 data files</td><br> </tr><br> <tr><br> <td>Source:</td><br> <td>325 articles from Xinhua newswire between 1994 and 1998</td><br> </tr><br> <tr><br> <td>Coding:</td><br> <td>GB code</td><br> </tr><br> <tr><br> <td>Format:</td><br> <td>Same as the UPenn English Treebank except that we keep some original file information was retained such as "SRCID" and "DATE" in the data file.</td><br> </tr><br> <tr><br> <td>Annotation:</td><br> <td>All the files are annotated at least twice, the first-pass is done by one annotator, and the resulting files are checked by the second annotator (second-pass).</td><br> </tr><br> <tr><br> <td>SGML:</td><br> <td>All data files validate against <a href="../../../Catalog/desc/addenda/LDC2001T11_1.dtd" rel="nofollow">chtb.dtd</a> using nsmls.</td><br> </tr><br> </tbody><br> </table><br> <p>The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in <a href="../../../Catalog/desc/addenda/LDC2001T11_2.tbl" rel="nofollow">file.tbl</a> which provides some annotator and historical information.</p></br> Portions © 1994-1998 Xinhua News Agency, © 2001 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作