five

Chinese Web 5-gram Version 1

收藏
DataCite Commons2025-01-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2010T06
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06 and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of text. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., segmentation, machine translation) as well as for other uses. </p><p>Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data.</p> <h3>Data Collection</h3> <p>N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect only Chinese language pages, some text from other languages is incidentally included in the final data.</p> <p>Data collection took place in March 2008; no text that was created on or after April 1, 2008 was used to develop this corpus.</p> <h3>Preprocessing</h3> <p>The input character encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized by an automatic tool, and all continuous Chinese character sequences were processed by the segmenter.</p> <p>The following types of tokens are considered valid:</p> <ul> <li>A Chinese word containing only Chinese characters.</li> <li>Numbers, e.g., 198, 2,200, 2.3, etc.</li> <li>Single Latin tokens, such as Google, &amp;ab, etc.</li> </ul><h3>Extent of Data</h3> <ul> <li>File sizes: approx. 30 GB compressed (gzip'ed) text files</li> <li>Number of tokens: 882,996,532,572</li> <li>Number of sentences: 102,048,435,515</li> <li>Number of unigrams: 1,616,150</li> <li>Number of bigrams: 281,107,315</li> <li>Number of trigrams: 1,024,642,142</li> <li>Number of fourgrams: 1,348,990,533</li> <li>Number of fivegrams: 1,256,043,325</li> </ul><h3>Sample</h3> <a href="./desc/addenda/LDC2010T06.jpg" rel="nofollow">Sample screen shot</a> </br> Portions © 2008 Google Inc., © 2010 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作