five

Web 1T 5-gram Version 1

收藏
DataCite Commons2022-11-30 更新2024-07-13 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2006T13
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.</p><br> <h3>Data</h3><br> <p>The n-gram counts were generated from text taken from publicly accessible Web pages.</p><br> <p>The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:</p><br> <ul><br> <li>Hyphenated word are usually separated, and hyphenated numbers usually form one token.</li><br> <li>Sequences of numbers separated by slashes (e.g. in dates) form one token.</li><br> <li>Sequences that look like urls or email addresses form one token.</li><br> </ul><br> <p>The files total 24 GB compressed (gzip'ed) text files containing the following:</p><br> <table><br> <tbody><br> <tr><br> <td>Tokens</td><br> <td>1,024,908,267,229</td><br> </tr><br> <tr><br> <td>Sentences</td><br> <td>95,119,665,584</td><br> </tr><br> <tr><br> <td>Unigrams</td><br> <td>13,588,391</td><br> </tr><br> <tr><br> <td>Bigrams</td><br> <td>314,843,401</td><br> </tr><br> <tr><br> <td>Trigrams</td><br> <td>977,069,902</td><br> </tr><br> <tr><br> <td>Fourgrams</td><br> <td>1,313,818,354</td><br> </tr><br> <tr><br> <td>Fivegrams</td><br> <td>1,176,470,663</td><br> </tr><br> </tbody><br> </table><br> <h3>Samples</h3><br> <p>For an example of the 3-gram data in this corpus, please review this <a href="desc/addenda/LDC2006T13.3gm.txt">text sample (TXT)</a>.</p><br> <p>For an example of the 4-gram data in this corpus, please review this <a href="desc/addenda/LDC2006T13.4gm.txt">text sample (TXT)</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作