Web 1T 5-gram Version 1
收藏DataCite Commons2022-11-30 更新2024-07-13 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T13
下载链接
链接失效反馈官方服务:
资源简介:
<h3>Introduction</h3><br>
<p>Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.</p><br>
<h3>Data</h3><br>
<p>The n-gram counts were generated from text taken from publicly accessible Web pages.</p><br>
<p>The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:</p><br>
<ul><br>
<li>Hyphenated word are usually separated, and hyphenated numbers usually form one token.</li><br>
<li>Sequences of numbers separated by slashes (e.g. in dates) form one token.</li><br>
<li>Sequences that look like urls or email addresses form one token.</li><br>
</ul><br>
<p>The files total 24 GB compressed (gzip'ed) text files containing the following:</p><br>
<table><br>
<tbody><br>
<tr><br>
<td>Tokens</td><br>
<td>1,024,908,267,229</td><br>
</tr><br>
<tr><br>
<td>Sentences</td><br>
<td>95,119,665,584</td><br>
</tr><br>
<tr><br>
<td>Unigrams</td><br>
<td>13,588,391</td><br>
</tr><br>
<tr><br>
<td>Bigrams</td><br>
<td>314,843,401</td><br>
</tr><br>
<tr><br>
<td>Trigrams</td><br>
<td>977,069,902</td><br>
</tr><br>
<tr><br>
<td>Fourgrams</td><br>
<td>1,313,818,354</td><br>
</tr><br>
<tr><br>
<td>Fivegrams</td><br>
<td>1,176,470,663</td><br>
</tr><br>
</tbody><br>
</table><br>
<h3>Samples</h3><br>
<p>For an example of the 3-gram data in this corpus, please review this <a href="desc/addenda/LDC2006T13.3gm.txt">text sample (TXT)</a>.</p><br>
<p>For an example of the 4-gram data in this corpus, please review this <a href="desc/addenda/LDC2006T13.4gm.txt">text sample (TXT)</a>.</p><br>
<h3>Updates</h3><br>
<p>None at this time.</p></br>
Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30



