Web 1T 5-gram Version 1

Name: Web 1T 5-gram Version 1
Creator: Linguistic Data Consortium
Published: 2022-11-30 08:55:41
License: 暂无描述

DataCite Commons2022-11-30 更新2024-07-13 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2006T13

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. <h3>Data</h3> The n-gram counts were generated from text taken from publicly accessible Web pages. The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following: <ul> <li>Hyphenated word are usually separated, and hyphenated numbers usually form one token.</li> <li>Sequences of numbers separated by slashes (e.g. in dates) form one token.</li> <li>Sequences that look like urls or email addresses form one token.</li> </ul> The files total 24 GB compressed (gzip'ed) text files containing the following: <table> <tbody> <tr> <td>Tokens</td> <td>1,024,908,267,229</td> </tr> <tr> <td>Sentences</td> <td>95,119,665,584</td> </tr> <tr> <td>Unigrams</td> <td>13,588,391</td> </tr> <tr> <td>Bigrams</td> <td>314,843,401</td> </tr> <tr> <td>Trigrams</td> <td>977,069,902</td> </tr> <tr> <td>Fourgrams</td> <td>1,313,818,354</td> </tr> <tr> <td>Fivegrams</td> <td>1,176,470,663</td> </tr> </tbody> </table> <h3>Samples</h3> For an example of the 3-gram data in this corpus, please review this <a href="desc/addenda/LDC2006T13.3gm.txt">text sample (TXT)</a>. For an example of the 4-gram data in this corpus, please review this <a href="desc/addenda/LDC2006T13.4gm.txt">text sample (TXT)</a>. <h3>Updates</h3> None at this time. Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集