nltk-data-hub/words
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/nltk-data-hub/words
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: en
data_files:
- split: words
path: data/en/words.parquet
- config_name: en-basic
data_files:
- split: words
path: data/en-basic/words.parquet
- config_name: ngsl
data_files:
- split: words
path: data/ngsl/words.parquet
- config_name: toeic
data_files:
- split: words
path: data/toeic/words.parquet
- config_name: nawl
data_files:
- split: words
path: data/nawl/words.parquet
- config_name: bsl
data_files:
- split: words
path: data/bsl/words.parquet
- config_name: opinion-positive
data_files:
- split: words
path: data/opinion-positive/words.parquet
- config_name: opinion-negative
data_files:
- split: words
path: data/opinion-negative/words.parquet
license: cc-by-4.0
task_categories:
- text-classification
- token-classification
pretty_name: NLTK Word Lists
---
# NLTK Word Lists
English word lists from [NLTK](https://www.nltk.org/),
the [New General Service List Project](https://www.newgeneralservicelist.com/),
and [Bing Liu's Opinion Lexicon](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html).
## Configs
| Config | Words | Schema | License | Source |
|---|---|---|---|---|
| `en` | 235,886 | `word` | NLTK (other) | NLTK words corpus |
| `en-basic` | 850 | `word` | Public domain | Ogden Basic English (1930) |
| `ngsl` | 2,809 | `word, rank, sfi, freq_per_million` | CC-BY-SA 4.0 | New General Service List 1.2 |
| `toeic` | 1,250 | `word, rank, sfi, freq_per_million` | CC-BY-SA 4.0 | TOEIC Service List 1.2 |
| `nawl` | 963 | `word, rank, band, sfi, freq_per_million` | CC-BY-SA 4.0 | New Academic Word List 1.2 |
| `bsl` | 1,744 | `word, rank, band, sfi, freq_per_million` | CC-BY-SA 4.0 | Business Service List 1.2 |
| `opinion-positive` | 2,006 | `word` | CC-BY 4.0 | Hu & Liu Opinion Lexicon |
| `opinion-negative` | 4,783 | `word` | CC-BY 4.0 | Hu & Liu Opinion Lexicon |
## See Also
These related word list datasets are also accessible via `nltk.corpus.words.words()`:
| Dataset | Contents | NLTK access |
|---|---|---|
| [nltk-data-hub/dolch](https://huggingface.co/datasets/nltk-data-hub/dolch) | 315 Dolch sight words, 8 POS configs | `words.words("dolch")`, `words.words("dolch-verbs")`, … |
| [nltk-data-hub/swadesh](https://huggingface.co/datasets/nltk-data-hub/swadesh) | 207 Swadesh concepts × 24 languages | `words.words("swadesh-en")`, `words.words("swadesh-de")`, … |
## Schemas
**`en`, `en-basic`, `opinion-positive`, `opinion-negative`** — word only
| Column | Type | Description |
|---|---|---|
| `word` | string | The word |
**`ngsl` and `toeic`** — frequency metadata, no band
| Column | Type | Description |
|---|---|---|
| `word` | string | Headword / lemma |
| `rank` | int | Frequency rank (1 = most frequent) |
| `sfi` | float | Standard Frequency Index |
| `freq_per_million` | float | Adjusted frequency per million words |
**`nawl` and `bsl`** — frequency metadata + pedagogical band
| Column | Type | Description |
|---|---|---|
| `word` | string | Headword / lemma |
| `rank` | int | Frequency rank within this list |
| `band` | int | Pedagogical band grouping (lower = more frequent) |
| `sfi` | float | Standard Frequency Index |
| `freq_per_million` | float | Adjusted frequency per million words |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("nltk-data-hub/words", "ngsl")
ds = load_dataset("nltk-data-hub/words", "nawl")
ds = load_dataset("nltk-data-hub/words", "opinion-positive")
ds = load_dataset("nltk-data-hub/words", "opinion-negative")
```
## Via NLTK
```python
import nltk
nltk.download("words", hf=True)
nltk.corpus.words.words("ngsl") # 2,809 words, frequency order
nltk.corpus.words.words("nawl") # 963 academic words
nltk.corpus.words.words("bsl") # 1,744 business words
nltk.corpus.words.words("toeic") # 1,250 TOEIC words
nltk.corpus.words.words("opinion-positive") # 2,006 positive opinion words
nltk.corpus.words.words("opinion-negative") # 4,783 negative opinion words
nltk.corpus.words.words("en") # 235,886 words
nltk.corpus.words.words("en-basic") # Ogden 850
# Routed to nltk-data-hub/dolch:
nltk.corpus.words.words("dolch") # 315 Dolch sight words
nltk.corpus.words.words("dolch-verbs") # 92 Dolch verbs
# Routed to nltk-data-hub/swadesh:
nltk.corpus.words.words("swadesh-en") # 207 English Swadesh words
nltk.corpus.words.words("swadesh-de") # 207 German Swadesh words
```
## Licenses
- `en`, `en-basic`: distributed as part of the NLTK corpus data package.
- `ngsl`, `toeic`, `nawl`, `bsl`: © Browne, Culligan & Phillips, licensed under
[CC-BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
- `opinion-positive`, `opinion-negative`: © Bing Liu, licensed under
[CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
## Citations
```bibtex
@book{nltk,
author = {Bird, Steven and Klein, Ewan and Loper, Edward},
title = {Natural Language Processing with Python},
publisher = {O'Reilly Media},
year = {2009},
url = {https://www.nltk.org/}
}
@article{ngsl,
author = {Browne, Charles},
title = {A New General Service List: The Better Mousetrap We've Been Looking For?},
journal = {Vocabulary Learning and Instruction},
volume = {3},
number = {2},
pages = {1--10},
year = {2014},
doi = {10.7820/vli.v03.2.browne}
}
@misc{nawl,
author = {Browne, Charles and Culligan, Brent and Phillips, Joseph},
title = {New Academic Word List 1.2},
year = {2013},
url = {https://www.newgeneralservicelist.com/nawl-new-academic-word-list}
}
@misc{tsl,
author = {Browne, Charles and Culligan, Brent},
title = {TOEIC Service List 1.2},
year = {2016},
url = {https://www.newgeneralservicelist.com/toeic-service-list}
}
@misc{bsl,
author = {Browne, Charles and Culligan, Brent},
title = {Business Service List 1.2},
year = {2016},
url = {https://www.newgeneralservicelist.com/business-service-list}
}
@inproceedings{opinion_lexicon,
author = {Hu, Minqing and Liu, Bing},
title = {Mining and Summarizing Customer Reviews},
booktitle = {Proceedings of KDD-2004},
year = {2004},
url = {http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html}
}
```
提供机构:
nltk-data-hub



