five

Chinese Treebank 6.0

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2007T36
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>This file contains documentation for Chinese Treebank 6.0, Linguistic Data Consortium (LDC) catalog number LDC2007T36 and isbn 1-58563-450-6.</p><br> <p>The Chinese Treebank project began at the University of Pennsylvania in 1998 and continues at Penn and the University of Colorado. Chinese Treebank 6.0 is the latest version produced from this effort, consisting of 780,000 words (over 1.28 million Chinese characters) that are segmented, part-of-speech tagged and fully bracketed. The data sources include newswire from Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative Region and transcripts from various broadcast news programs.</p><br> <p>The LDC published Chinese Treebank 1.0 in 2000; it was later corrected and released in 2001 as <a href="http://catalog.ldc.upenn.edu/LDC2001T11" rel="nofollow">Chinese Treebank 2.0 (LDC2001T11)</a> and consisted of approximately 100,000 words. The LDC released <a href="http://catalog.ldc.upenn.edu/LDC2004T05" rel="nofollow">Chinese Treebank 4.0 (LDC2004T05)</a>, an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word <a href="http://catalog.ldc.upenn.edu/LDC2005T01" rel="nofollow">Chinese Treebank 5.0 (LDC2005T01)</a>.</p><br> <p>For information about Chinese Treebank methodology and guidelines, consult the attached documentation files and the <a href="https://www.cs.brandeis.edu/~llc/page2/page2.html">Chinese Treebank Project</a> website.</p><br> <p>This release encompasses 2,036 text files, containing 28,295 sentences, 781,351 words and 1,285,149 hanzi (Chinese characters). The data is provided in two encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed.</p><br> <h3>Samples</h3><br> <p>For an example of the data in this publication, please examine this <a href="desc/addenda/LDC2007T36.jpg" rel="nofollow">sample</a> of the bracketed data.</p></br> Portions © 2000-2001 China Broadcasting System, © 2000-2001 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004, 2005, 2007 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Chinese Treebank 6.0是一个中文树库数据集,发布于2007年,包含约78万个词(1.28百万汉字),数据源自新闻文本如新华社和Sinorama杂志,并进行了分词、词性标注和句法括号化标注。该数据集适用于自然语言处理、句法分析和机器翻译等任务,是中文树库系列的更新版本,提供多种数据格式和编码支持。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作