English Word Frequency

Name: English Word Frequency
Creator: Kaggle
Published: 2017-09-06 00:00:00
License: 暂无描述

www.kaggle.com2017-09-06 更新2025-01-21 收录

下载链接：

https://www.kaggle.com/rtatman/english-word-frequency

下载链接

链接失效反馈

官方服务：

资源简介：

### Context: How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, [very frequent words are read and understood more quickly](http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea) and can be [understood more easily in background noise](http://asa.scitation.org/doi/abs/10.1121/1.1918432). ### Content: This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus. ### Acknowledgements: Data files were derived from the Google Web Trillion Word Corpus (as [described](https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html) by Thorsten Brants and Alex Franz, and [distributed](https://catalog.ldc.upenn.edu/LDC2006T13) by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them [here](http://norvig.com/ngrams/). The code used to generate this dataset is distributed under the [MIT License](https://en.wikipedia.org/wiki/MIT_License). ### Inspiration: * Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like [Japanese](https://www.kaggle.com/rtatman/japanese-lemma-frequency)? * What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the [Brown Corpus](https://www.kaggle.com/nltkdata/brown-corpus) or the [TIMIT corpus](https://www.kaggle.com/nltkdata/timitcorpus)? What might these differences tell us about how language is used?

{'Context': '词汇在一种语言中的出现频率对于自然语言处理及语言学家而言，是一项至关重要的信息。在自然语言处理领域，相较于频率较低之词，高频词汇往往信息含量较低，且在预处理阶段常常被去除。人类语言使用者亦对词汇频率极为敏感。词汇的使用频率会影响人类语言处理过程。例如，[高频词汇的阅读和理解速度较快](http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea)，且在背景噪音中更易于理解。[http://asa.scitation.org/doi/abs/10.1121/1.1918432](http://asa.scitation.org/doi/abs/10.1121/1.1918432).', 'Content': '本数据集收录了源自谷歌万词语料库（由Peter Norvig基于谷歌万词语料库[如Thorsten Brants和Alex Franz所描述](https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html)并[由语言数据联盟分发](https://catalog.ldc.upenn.edu/LDC2006T13)）中，使用频率最高的333,333个单字词汇的计数。', 'Acknowledgements': '数据文件来源于谷歌万词语料库（由Peter Norvig基于Thorsten Brants和Alex Franz的描述[如所描述](https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html)并[分发](https://catalog.ldc.upenn.edu/LDC2006T13)），详细信息及生成这些文件所使用的代码可在[此处](http://norvig.com/ngrams/)找到。', 'Inspiration': ['能否为这些词汇标注词性？哪些词性的频率最高？这与日语等其他语言是否相似？[如日本语](https://www.kaggle.com/rtatman/japanese-lemma-frequency)？', '本数据集中高频词汇与其他语料库（如[布朗语料库](https://www.kaggle.com/nltkdata/brown-corpus)或[TIMIT语料库](https://www.kaggle.com/nltkdata/timitcorpus)）中的高频词汇有何区别？这些差异可能揭示出关于语言使用的哪些信息？']}

提供机构：

Kaggle

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集包含33.3万个最常用的英语单词及其在互联网上的出现频率，数据来源于Google Web Trillion Word Corpus。它提供了单词和对应计数的两列结构，适用于自然语言处理任务（如词频分析和预处理）以及语言学研究，帮助理解单词在真实网络语料中的使用分布。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集