Chinese Treebank 2.0

Name: Chinese Treebank 2.0
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:14:26
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2001T11

下载链接

链接失效反馈

官方服务：

资源简介：

The Chinese Treebank 2.0 was produced by: Principal Investigators: Martha Palmer, Mitch Marcus, Tony Kroch Consultants: Martha Palmer, Mitch Marcus, Tony Kroch, Shizhe Huang, Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc Project Managers and Guideline Designers: Fei Xia, Nianwen Xue Annotators: Fu-Dong Chiou, Nianwen Xue Programming support: Zhibiao Wu <h3>Introduction</h3> Published by the Linguistic Data Consortium (LDC), catalog number LDC2001T11 and ISBN 1-58563-204-X. The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at <a href="https://www.cs.brandeis.edu/~llc/page2/page2.html">The Chinese Treebank Project</a>. Chinese Treebank 2.0 supersedes and replaces the Chinese Penn Treebank Final Release (LDC2000T48 ISBN 1-58563-187-6). <h3>Data</h3> <table> <tbody> <tr> <td>Size:</td> <td>About 100K words, 325 data files</td> </tr> <tr> <td>Source:</td> <td>325 articles from Xinhua newswire between 1994 and 1998</td> </tr> <tr> <td>Coding:</td> <td>GB code</td> </tr> <tr> <td>Format:</td> <td>Same as the UPenn English Treebank except that we keep some original file information was retained such as "SRCID" and "DATE" in the data file.</td> </tr> <tr> <td>Annotation:</td> <td>All the files are annotated at least twice, the first-pass is done by one annotator, and the resulting files are checked by the second annotator (second-pass).</td> </tr> <tr> <td>SGML:</td> <td>All data files validate against <a href="../../../Catalog/desc/addenda/LDC2001T11_1.dtd" rel="nofollow">chtb.dtd</a> using nsmls.</td> </tr> </tbody> </table> The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in <a href="../../../Catalog/desc/addenda/LDC2001T11_2.tbl" rel="nofollow">file.tbl</a> which provides some annotator and historical information. Portions © 1994-1998 Xinhua News Agency, © 2001 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集