mjbommar/opengloss-v1.1-dictionary

Name: mjbommar/opengloss-v1.1-dictionary
Creator: mjbommar
Published: 2025-12-02 23:55:07
License: 暂无描述

Hugging Face2025-12-02 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/mjbommar/opengloss-v1.1-dictionary

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - question-answering - text-classification - feature-extraction language: - en tags: - dictionary - lexicon - wordnet - semantic-network - knowledge-graph - encyclopedic - etymology - synthetic - education size_categories: - 100K<n<1M --- # OpenGloss Dictionary v1.1 (Word-Level) ## Dataset Summary **OpenGloss** is a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. This dataset provides the **words-level view** where each record represents one lexeme (word or multi-word expression). ### Key Statistics - **150,637 lexemes** - **7,701,312 semantic edges** (synonyms, antonyms, hypernyms, hyponyms, collocations, inflections) - **150,637 entries** with encyclopedic content (100.0% coverage) - **150,637 entries** with etymology (100.0% coverage) - **149,734 entries** with Wikipedia frequency data (99.4% coverage) - **100% reading level coverage** (K through PhD scale) - **100% domain tag coverage** (10+ subject domain categories) - **Average 3.36 senses per lexeme** - **Average 51.1 edges per lexeme** ### What's New in v1.1? Compared to OpenGloss v1.0 (150,101 words): 1. **Wikipedia Frequency Data** (~99.5% coverage): Raw occurrence counts and frequency ranks from Wikipedia, enabling importance-based filtering and sorting 2. **Educational Reading Levels** (100% coverage): Lexemes tagged with grade levels (K, 1-12, BS, PhD) for curriculum alignment and differentiated instruction 3. **Domain Tags** (100% coverage): Subject-area classification (language, science, technology, society, history, etc.) for content-specific applications 4. **Structured Etymology Segments**: Detailed historical trail with language, era, gloss, and citation sources (enhanced from v1.0's text summaries) 5. **Structured Morphology**: 7 inflection types (plural, past_tense, comparative, etc.) + 4 derivation types per POS 6. **Rich Edge Metadata**: Semantic relationships now include domain context and educational features 7. **Hierarchical POS Entries**: New structured format alongside backward-compatible flattened senses 8. **Audit Timestamps**: created_at/updated_at fields for version tracking and data provenance ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 122,564 | | adjective | 55,905 | | verb | 36,420 | | adverb | 5,583 | | determiner | 1,510 | | preposition | 1,234 | | interjection | 938 | | pronoun | 395 | | conjunction | 249 | | particle | 18 | | proper noun | 9 | | numeral | 5 | | proper_noun | 4 | | prefix | 2 | | suffix | 1 | | abbreviation | 1 | | adjetivo | 1 | | sustantivo | 1 | ### Edge Type Distribution | Relationship Type | Count | |-------------------|-------| | synonym | 1,485,130 | | hyponym | 1,285,557 | | collocation | 1,227,353 | | antonym | 1,007,039 | | hypernym | 994,981 | | etymology_parent | 697,485 | | inflection | 353,210 | | derivation_noun | 279,501 | | derivation_adjective | 163,035 | | derivation_verb | 119,944 | | derivation_adverb | 68,105 | | cognate | 19,972 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.1-dictionary") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Senses: {record['total_senses']}") print(f"Edges: {record['total_edges']}\n") ``` ## New v1.1 Fields & Usage Examples ### Wikipedia Frequency Data Filter by word importance using frequency data: ```python # Get high-frequency words (top 10,000) common_words = dataset["train"].filter( lambda x: x["wiki_frequency_rank"] is not None and x["wiki_frequency_rank"] <= 10000 ) # Sort by frequency sorted_by_freq = dataset["train"].sort("wiki_frequency", reverse=True) ``` ### Reading Levels Filter vocabulary by grade level for educational applications: ```python # Elementary (K-5) elementary = dataset["train"].filter(lambda x: x["reading_level"] in ["K", "1", "2", "3", "4", "5"]) # Middle school (6-8) middle_school = dataset["train"].filter(lambda x: x["reading_level"] in ["6", "7", "8"]) # High school (9-12) high_school = dataset["train"].filter(lambda x: x["reading_level"] in ["9", "10", "11", "12"]) # Advanced (BS/PhD) advanced = dataset["train"].filter(lambda x: x["reading_level"] in ["BS", "PhD"]) ``` ### Domain Tags Filter by subject area for content-specific applications: ```python # Science vocabulary science_words = dataset["train"].filter( lambda x: any("science" in tag or "life-sciences" in tag for tag in x.get("tags", [])) ) # Technology vocabulary tech_words = dataset["train"].filter( lambda x: any("technology" in tag for tag in x.get("tags", [])) ) # Social studies social_studies = dataset["train"].filter( lambda x: any(tag.startswith("domain:history") or tag.startswith("domain:society") for tag in x.get("tags", [])) ) ``` ### Etymology Segments Access structured etymology with language trail: ```python # Words with detailed etymology words_with_etymology = dataset["train"].filter(lambda x: len(x.get("etymology_segments", [])) > 0) # Find words from specific language origins latin_origin = dataset["train"].filter( lambda x: any(seg.get("language", "").lower() == "latin" for seg in x.get("etymology_segments", [])) ) ``` ## Citation If you use OpenGloss in your research, please cite: ```bibtex @misc{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Version History - **v1.1** (2025-11): Enhanced release with structured morphology, etymology segments, and frequency data - **v1.0** (2025-01): Initial release ## Acknowledgments This dataset was generated using: - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured LLM generation - OpenAI GPT models for content generation - Anthropic Claude for quality assurance --- *Generated from the OpenGloss v1.1 dataset.*

提供机构：

mjbommar

5,000+

优质数据集

54 个

任务类型

进入经典数据集