five

Japanese Web N-gram Version 1

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2009T08
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3> <p>Japanese Web N-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2009T08 and isbn 1-58563-510-3, was created by Google Inc. It consists of Japanese "word" n-grams and their observed frequency counts generated from over 255 billion tokens of text. The length of the n-grams ranges from unigrams to seven-grams.</p> <p>The n-grams were extracted from publicly accessible web pages that were crawled by Google in July 2007. This data set contains only n-grams that appear at least 20 times in the processed sentences. Less frequent n-grams were simply discarded. Those web pages requiring user authentication, pages containing "noarchive" or "noindex" meta tags, and pages under other special restrictions were excluded from the final release. While the aim was to process only Japanese pages, the corpus may contain some pages in other languages due to language detection errors. This dataset will be useful for research in areas such as statistical machine translation, language modeling and speech recognition, among others. </p> <h3>Data</h3> <p>Before the n-grams were collected, the web pages were converted into UTF-8 encoding, normalized into Unicode Normalization Form KC (see below), and split into sentences. Ill-formed sentences were filtered out, and the remaining sentences were segmented into "words". </p><p>All strings were normalized into Unicode Normalization Form KC (NFKC), which is described in <a href="http://www.unicode.org/unicode/reports/tr15/" rel="nofollow">http://www.unicode.org/unicode/reports/tr15/</a>. Japanese strings were normalized according to the following rules: </p><ul> <li>Full-width letters/digits were converted to ASCII letters/digits</li> <li>Half-width katakana were converted to full-width katakana</li> <li>Glyphs for Roman digits were converted to ASCII characters </li> <li>Certain Japanese-specific symbols were converted </li> </ul><p>The vocabulary was restricted to "words" that appeared at least 50 times in the processed sentences. </p> <p>Statistical information about the corpus is set forth in the following table:</p> <table> <tr> <th>Data size</th> <td>The total compressed data size is about 26GB.</td> </tr> <tr> <th>Number of tokens: </th> <td>255,198,240,937</td> </tr> <tr> <th>Number of sentences: </th> <td> 20,036,793,177</td> </tr> <tr> <th> Number of unique unigrams: </th> <td> 2,565,424</td> </tr> <tr> <th> Number of unique bigrams: </th> <td> 80,513,289</td> </tr> <tr> <th>Number of unique trigrams: </th> <td> 394,482,216</td> </tr> <tr> <th>Number of unique 4-grams: </th> <td> 707,787,333</td> </tr> <tr> <th>Number of unique 5-grams: </th> <td> 776,378,943</td> </tr> <tr> <th>Number of unique 6-grams: </th> <td> 688,782,933</td> </tr> <tr> <th>Number of unique 7-grams: </th> <td> 570,204,252</td> </tr> </table><h3>Samples</h3> <p><a href="./desc/addenda/LDC2009T08_Bigram.png" rel="nofollow">Japanese Bigram</a> <a href="./desc/addenda/LDC2009T08_Trigram.png" rel="nofollow">Japanese Trigram</a> </p> </br> Portions © 2007 Google Inc., © 2009 Trustees of the University of Pennsylvania
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作