Chinese Web 5-gram Version 1

Name: Chinese Web 5-gram Version 1
Creator: Linguistic Data Consortium
Published: 2025-01-01 08:55:19
License: 暂无描述

DataCite Commons2025-01-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2010T06

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p>Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06 and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of text. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., segmentation, machine translation) as well as for other uses. </p><p>Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data.</p> <h3>Data Collection</h3> <p>N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect only Chinese language pages, some text from other languages is incidentally included in the final data.</p> <p>Data collection took place in March 2008; no text that was created on or after April 1, 2008 was used to develop this corpus.</p> <h3>Preprocessing</h3> <p>The input character encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized by an automatic tool, and all continuous Chinese character sequences were processed by the segmenter.</p> <p>The following types of tokens are considered valid:</p> <ul> <li>A Chinese word containing only Chinese characters.</li> <li>Numbers, e.g., 198, 2,200, 2.3, etc.</li> <li>Single Latin tokens, such as Google, &ab, etc.</li> </ul><h3>Extent of Data</h3> <ul> <li>File sizes: approx. 30 GB compressed (gzip'ed) text files</li> <li>Number of tokens: 882,996,532,572</li> <li>Number of sentences: 102,048,435,515</li> <li>Number of unigrams: 1,616,150</li> <li>Number of bigrams: 281,107,315</li> <li>Number of trigrams: 1,024,642,142</li> <li>Number of fourgrams: 1,348,990,533</li> <li>Number of fivegrams: 1,256,043,325</li> </ul><h3>Sample</h3> <a href="./desc/addenda/LDC2010T06.jpg" rel="nofollow">Sample screen shot</a> </br> Portions © 2008 Google Inc., © 2010 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集