five

ps-abhi/GLM-English-Vocab-Definitions

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ps-abhi/GLM-English-Vocab-Definitions
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 tags: - synthetic - definitions - vocabulary - nlp - lexicon - glm annotations_creators: - machine-generated language_creators: - machine-generated pretty_name: GLM English Vocab Definitions size_categories: - 100K<n<1M source_datasets: - dwyl/english-words task_categories: - text-generation - fill-mask dataset_info: features: - name: word dtype: string - name: definitions sequence: string splits: - name: train num_examples: 316380 --- # GLM English Vocab Definitions A comprehensive English vocabulary dataset containing **316,380 words** and **489,775 definitions**, generated synthetically using GLM-4.5. ## Dataset Description Each entry maps an English word to one or more definitions. Definitions are complete, self-contained sentences that cover distinct meanings, encyclopedic facts, and domain-specific usage where applicable. ### Example ```json { "abaris": [ "Abaris is a legendary Hyperborean sage of ancient Greece who was said to have traveled without eating, riding on a golden arrow given to him by Apollo." ], "abarthrosis": [ "Abarthrosis is a type of joint in which the articulating surfaces are separated by a fluid-containing cavity, allowing for free movement." ] } ``` ### Statistics | Metric | Value | |---|---| | Total words | 316,380 | | Total definitions | 489,775 | | Avg. definitions per word | 1.5 | | Language | English | ## Generation Process 1. **Source words** were obtained from [dwyl/english-words](https://github.com/dwyl/english-words) (`words_alpha.txt`), filtered to alphabetic words with length > 2 and deduplicated. 2. **Definitions were generated** using [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) with structured output via the [Instructor](https://github.com/instructor-ai/instructor) library. 3. Words were processed in **batches of 50** with up to 10 concurrent requests at temperature 0.3. 4. The model was prompted as an "Expert Lexicographer" with instructions to: - Provide separate definitions for distinct meanings (polysemy) - Include brief encyclopedic facts (historical, scientific, cultural) - Make each definition self-contained (explicitly includes the term) - Return `No_Def_Found` for invalid, vaguely archaic, or nonsense words (Existing word-definition pairs with this value have been purged.) 5. A **hallucination filter** discarded any definitions returned for words not present in the input batch. ## Intended Uses - Training or fine-tuning language models on vocabulary and definition tasks - Dictionary and reference applications - NLP research (word sense disambiguation, definition generation, etc.) - Educational tools and vocabulary builders ## Limitations - Definitions are **synthetically generated** and may contain inaccuracies, hallucinations, or incomplete information. They should not be treated as authoritative dictionary entries. - Some rare, archaic, or highly technical words may have imprecise definitions. - Coverage of polysemy (multiple meanings) varies; common words may be under-represented in their number of senses compared to a professional dictionary. ## Source Data and Licensing | Component | License | |---|---| | Input word list ([dwyl/english-words](https://github.com/dwyl/english-words)) | [Unlicense](https://github.com/dwyl/english-words/blob/master/LICENSE.md) (public domain) | | Generation model (GLM-4.5) | [MIT License](https://huggingface.co/zai-org/GLM-4.5) | | This dataset | [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) | ## Citation If you use this dataset, please cite it as: ```bibtex @dataset{glm_english_vocab_definitions, title={GLM English Vocab Definitions}, author={P S Abhishek}, year={2026}, url={https://huggingface.co/datasets/ps-abhi/GLM-English-Vocab-Definitions}, note={Generated using GLM-4.5 from dwyl/english-words} } ```
提供机构:
ps-abhi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作