five

nltk-data-hub/words

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/nltk-data-hub/words
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: en data_files: - split: words path: data/en/words.parquet - config_name: en-basic data_files: - split: words path: data/en-basic/words.parquet - config_name: ngsl data_files: - split: words path: data/ngsl/words.parquet - config_name: toeic data_files: - split: words path: data/toeic/words.parquet - config_name: nawl data_files: - split: words path: data/nawl/words.parquet - config_name: bsl data_files: - split: words path: data/bsl/words.parquet - config_name: opinion-positive data_files: - split: words path: data/opinion-positive/words.parquet - config_name: opinion-negative data_files: - split: words path: data/opinion-negative/words.parquet license: cc-by-4.0 task_categories: - text-classification - token-classification pretty_name: NLTK Word Lists --- # NLTK Word Lists English word lists from [NLTK](https://www.nltk.org/), the [New General Service List Project](https://www.newgeneralservicelist.com/), and [Bing Liu's Opinion Lexicon](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). ## Configs | Config | Words | Schema | License | Source | |---|---|---|---|---| | `en` | 235,886 | `word` | NLTK (other) | NLTK words corpus | | `en-basic` | 850 | `word` | Public domain | Ogden Basic English (1930) | | `ngsl` | 2,809 | `word, rank, sfi, freq_per_million` | CC-BY-SA 4.0 | New General Service List 1.2 | | `toeic` | 1,250 | `word, rank, sfi, freq_per_million` | CC-BY-SA 4.0 | TOEIC Service List 1.2 | | `nawl` | 963 | `word, rank, band, sfi, freq_per_million` | CC-BY-SA 4.0 | New Academic Word List 1.2 | | `bsl` | 1,744 | `word, rank, band, sfi, freq_per_million` | CC-BY-SA 4.0 | Business Service List 1.2 | | `opinion-positive` | 2,006 | `word` | CC-BY 4.0 | Hu & Liu Opinion Lexicon | | `opinion-negative` | 4,783 | `word` | CC-BY 4.0 | Hu & Liu Opinion Lexicon | ## See Also These related word list datasets are also accessible via `nltk.corpus.words.words()`: | Dataset | Contents | NLTK access | |---|---|---| | [nltk-data-hub/dolch](https://huggingface.co/datasets/nltk-data-hub/dolch) | 315 Dolch sight words, 8 POS configs | `words.words("dolch")`, `words.words("dolch-verbs")`, … | | [nltk-data-hub/swadesh](https://huggingface.co/datasets/nltk-data-hub/swadesh) | 207 Swadesh concepts × 24 languages | `words.words("swadesh-en")`, `words.words("swadesh-de")`, … | ## Schemas **`en`, `en-basic`, `opinion-positive`, `opinion-negative`** — word only | Column | Type | Description | |---|---|---| | `word` | string | The word | **`ngsl` and `toeic`** — frequency metadata, no band | Column | Type | Description | |---|---|---| | `word` | string | Headword / lemma | | `rank` | int | Frequency rank (1 = most frequent) | | `sfi` | float | Standard Frequency Index | | `freq_per_million` | float | Adjusted frequency per million words | **`nawl` and `bsl`** — frequency metadata + pedagogical band | Column | Type | Description | |---|---|---| | `word` | string | Headword / lemma | | `rank` | int | Frequency rank within this list | | `band` | int | Pedagogical band grouping (lower = more frequent) | | `sfi` | float | Standard Frequency Index | | `freq_per_million` | float | Adjusted frequency per million words | ## Usage ```python from datasets import load_dataset ds = load_dataset("nltk-data-hub/words", "ngsl") ds = load_dataset("nltk-data-hub/words", "nawl") ds = load_dataset("nltk-data-hub/words", "opinion-positive") ds = load_dataset("nltk-data-hub/words", "opinion-negative") ``` ## Via NLTK ```python import nltk nltk.download("words", hf=True) nltk.corpus.words.words("ngsl") # 2,809 words, frequency order nltk.corpus.words.words("nawl") # 963 academic words nltk.corpus.words.words("bsl") # 1,744 business words nltk.corpus.words.words("toeic") # 1,250 TOEIC words nltk.corpus.words.words("opinion-positive") # 2,006 positive opinion words nltk.corpus.words.words("opinion-negative") # 4,783 negative opinion words nltk.corpus.words.words("en") # 235,886 words nltk.corpus.words.words("en-basic") # Ogden 850 # Routed to nltk-data-hub/dolch: nltk.corpus.words.words("dolch") # 315 Dolch sight words nltk.corpus.words.words("dolch-verbs") # 92 Dolch verbs # Routed to nltk-data-hub/swadesh: nltk.corpus.words.words("swadesh-en") # 207 English Swadesh words nltk.corpus.words.words("swadesh-de") # 207 German Swadesh words ``` ## Licenses - `en`, `en-basic`: distributed as part of the NLTK corpus data package. - `ngsl`, `toeic`, `nawl`, `bsl`: © Browne, Culligan & Phillips, licensed under [CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). - `opinion-positive`, `opinion-negative`: © Bing Liu, licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). ## Citations ```bibtex @book{nltk, author = {Bird, Steven and Klein, Ewan and Loper, Edward}, title = {Natural Language Processing with Python}, publisher = {O'Reilly Media}, year = {2009}, url = {https://www.nltk.org/} } @article{ngsl, author = {Browne, Charles}, title = {A New General Service List: The Better Mousetrap We've Been Looking For?}, journal = {Vocabulary Learning and Instruction}, volume = {3}, number = {2}, pages = {1--10}, year = {2014}, doi = {10.7820/vli.v03.2.browne} } @misc{nawl, author = {Browne, Charles and Culligan, Brent and Phillips, Joseph}, title = {New Academic Word List 1.2}, year = {2013}, url = {https://www.newgeneralservicelist.com/nawl-new-academic-word-list} } @misc{tsl, author = {Browne, Charles and Culligan, Brent}, title = {TOEIC Service List 1.2}, year = {2016}, url = {https://www.newgeneralservicelist.com/toeic-service-list} } @misc{bsl, author = {Browne, Charles and Culligan, Brent}, title = {Business Service List 1.2}, year = {2016}, url = {https://www.newgeneralservicelist.com/business-service-list} } @inproceedings{opinion_lexicon, author = {Hu, Minqing and Liu, Bing}, title = {Mining and Summarizing Customer Reviews}, booktitle = {Proceedings of KDD-2004}, year = {2004}, url = {http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html} } ```
提供机构:
nltk-data-hub
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作