ps-abhi/GLM-English-Vocab-Definitions
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ps-abhi/GLM-English-Vocab-Definitions
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
tags:
- synthetic
- definitions
- vocabulary
- nlp
- lexicon
- glm
annotations_creators:
- machine-generated
language_creators:
- machine-generated
pretty_name: GLM English Vocab Definitions
size_categories:
- 100K<n<1M
source_datasets:
- dwyl/english-words
task_categories:
- text-generation
- fill-mask
dataset_info:
features:
- name: word
dtype: string
- name: definitions
sequence: string
splits:
- name: train
num_examples: 316380
---
# GLM English Vocab Definitions
A comprehensive English vocabulary dataset containing **316,380 words** and **489,775 definitions**, generated synthetically using GLM-4.5.
## Dataset Description
Each entry maps an English word to one or more definitions. Definitions are complete, self-contained sentences that cover distinct meanings, encyclopedic facts, and domain-specific usage where applicable.
### Example
```json
{
"abaris": [
"Abaris is a legendary Hyperborean sage of ancient Greece who was said to have traveled without eating, riding on a golden arrow given to him by Apollo."
],
"abarthrosis": [
"Abarthrosis is a type of joint in which the articulating surfaces are separated by a fluid-containing cavity, allowing for free movement."
]
}
```
### Statistics
| Metric | Value |
|---|---|
| Total words | 316,380 |
| Total definitions | 489,775 |
| Avg. definitions per word | 1.5 |
| Language | English |
## Generation Process
1. **Source words** were obtained from [dwyl/english-words](https://github.com/dwyl/english-words) (`words_alpha.txt`), filtered to alphabetic words with length > 2 and deduplicated.
2. **Definitions were generated** using [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) with structured output via the [Instructor](https://github.com/instructor-ai/instructor) library.
3. Words were processed in **batches of 50** with up to 10 concurrent requests at temperature 0.3.
4. The model was prompted as an "Expert Lexicographer" with instructions to:
- Provide separate definitions for distinct meanings (polysemy)
- Include brief encyclopedic facts (historical, scientific, cultural)
- Make each definition self-contained (explicitly includes the term)
- Return `No_Def_Found` for invalid, vaguely archaic, or nonsense words (Existing word-definition pairs with this value have been purged.)
5. A **hallucination filter** discarded any definitions returned for words not present in the input batch.
## Intended Uses
- Training or fine-tuning language models on vocabulary and definition tasks
- Dictionary and reference applications
- NLP research (word sense disambiguation, definition generation, etc.)
- Educational tools and vocabulary builders
## Limitations
- Definitions are **synthetically generated** and may contain inaccuracies, hallucinations, or incomplete information. They should not be treated as authoritative dictionary entries.
- Some rare, archaic, or highly technical words may have imprecise definitions.
- Coverage of polysemy (multiple meanings) varies; common words may be under-represented in their number of senses compared to a professional dictionary.
## Source Data and Licensing
| Component | License |
|---|---|
| Input word list ([dwyl/english-words](https://github.com/dwyl/english-words)) | [Unlicense](https://github.com/dwyl/english-words/blob/master/LICENSE.md) (public domain) |
| Generation model (GLM-4.5) | [MIT License](https://huggingface.co/zai-org/GLM-4.5) |
| This dataset | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
## Citation
If you use this dataset, please cite it as:
```bibtex
@dataset{glm_english_vocab_definitions,
title={GLM English Vocab Definitions},
author={P S Abhishek},
year={2026},
url={https://huggingface.co/datasets/ps-abhi/GLM-English-Vocab-Definitions},
note={Generated using GLM-4.5 from dwyl/english-words}
}
```
提供机构:
ps-abhi



