five

mjbommar/opengloss-dictionary-definitions

收藏
Hugging Face2025-11-23 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-dictionary-definitions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering - text-classification - feature-extraction language: - en tags: - dictionary - lexicon - wordnet - semantic-network - knowledge-graph - encyclopedic - etymology - synthetic - education size_categories: - 100K<n<1M --- # OpenGloss Dictionary (Definition-Level) ## Dataset Summary **OpenGloss** is a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. This dataset provides the **definitions-level view** where each record represents one sense definition. ### Key Statistics - **536,829 sense definitions** across 150,101 English lexemes - **9.1 million semantic edges** (synonyms, antonyms, hypernyms, hyponyms, collocations, inflections) - **1 million usage examples** demonstrating words in context - **3 million collocations** showing common word combinations - **60 million words of encyclopedic content** (200-400 words per entry, 99.7% coverage) - **Etymology trails** for 97.5% of entries documenting historical development - **Average 3.58 senses per lexeme**, balancing granularity with usability ### What Makes OpenGloss Unique? Unlike traditional computational lexicons: 1. **Integrated Content**: Each entry combines definitions, examples, semantic relationships, morphology, collocations, encyclopedic context, and etymology 2. **Pedagogical Focus**: Designed for K-12 education and vocabulary learning with age-appropriate content 3. **Rich Connectivity**: Near-universal semantic relationship coverage (99.7% of senses have synonyms, hypernyms, and examples) 4. **Multi-word Expressions**: 37.3% of lexemes are multi-word phrases reflecting natural language usage 5. **Synthetic Generation**: Created via multi-agent LLM pipeline with schema validation in <1 week for <$1,000 ## Dataset Structure ### Data Format This dataset is provided as JSONL (JSON Lines), with each line containing one complete record. ### Definition-Level Schema Each record represents a single sense definition (one meaning of one part of speech). **Core Fields:** - `id`: Unique identifier `{word}_{pos}_{sense_index}` (e.g., "algorithm_noun_0") - `word`: The lexeme string - `part_of_speech`: POS tag for this sense - `sense_index`: 0-indexed sense number within this POS - `global_sense_index`: 0-indexed across all POS - `text`: Markdown rendering of this specific sense (optional field) **Definition & Semantics:** - `definition`: The core definition text - `synonyms`: Sense-specific synonyms - `antonyms`: Sense-specific antonyms - `hypernyms`: Broader concepts (ordered) - `hyponyms`: Narrower concepts - `examples`: Usage examples for this sense **Morphology (POS-specific):** - `base_form`: Canonical form - `inflections`: Inflected forms - `derivations`: Derived forms - `collocations`: Common collocations **Graph Edges:** - `sense_edges`: Semantic relationships specific to this sense - `pos_level_edges`: POS-level relationships (collocations, inflections) - `total_edges`: Count of all edges **Word-Level Context:** - `total_senses_for_word`: How many senses this word has - `total_pos_for_word`: How many POS this word has - `all_pos_for_word`: List of all POS - `is_stopword`: Boolean classification **Metadata:** - `has_etymology`: Boolean flag - `has_encyclopedia`: Boolean flag - `processed_at`: ISO timestamp ### Example Record ```json { "id": "algorithm_noun_0", "word": "algorithm", "part_of_speech": "noun", "sense_index": 0, "global_sense_index": 0, "text": "## algorithm (noun) - Sense 1\n\n**Definition:** A finite, stepwise procedure...", "definition": "A finite, stepwise procedure for solving a problem or completing a computation.", "synonyms": ["procedure", "process", "method", "routine"], "antonyms": [], "hypernyms": ["procedure", "technique", "system"], "hyponyms": ["sorting algorithm", "search algorithm"], "examples": [ "The student traced each algorithm step to verify the answer.", "We compared an arithmetic algorithm with a geometric approach." ], "base_form": "algorithm", "inflections": ["algorithms"], "derivations": ["algorithmic", "algorithmically"], "collocations": ["algorithm design", "sorting algorithm"], "sense_edges": [...], "pos_level_edges": [...], "total_edges": 12, "total_senses_for_word": 2, "total_pos_for_word": 1, "all_pos_for_word": ["noun"], "is_stopword": false, "has_etymology": true, "has_encyclopedia": true, "processed_at": "2025-11-16T15:18:12.341591" } ``` ### Use Cases This **definition-level dataset** is ideal for: - **Word Sense Disambiguation (WSD)**: Training and evaluation - **Definition Generation**: Learning to generate dictionary definitions - **Semantic Similarity**: Sense-level similarity computation - **Example Sentence Generation**: Contextual usage patterns - **Fine-grained Vocabulary Learning**: Sense-specific instruction - **Lexical Substitution**: Finding appropriate synonyms for specific senses - **Relation Extraction**: Training models on semantic relationships ## Dataset Creation ### Generation Methodology OpenGloss was created using a **multi-agent procedural generation pipeline** with: 1. **Lexeme Selection**: 150,101 lexemes from American English word lists + educational vocabulary expansion 2. **Sense Generation**: Two-agent architecture (overview + POS details) producing schema-validated definitions 3. **Graph Construction**: Deterministic edge extraction creating 9.1M semantic relationships 4. **Enrichment**: Etymology and encyclopedia agents adding contextual content All outputs use Pydantic V2 schema validation ensuring structural consistency. ### Models and Infrastructure - **Generation**: OpenAI GPT-5-nano via pydantic-ai - **Quality Assurance**: Claude Sonnet 4.5 - **Cost**: <$1,000 total API spend - **Time**: <96 hours wall-clock time - **Validation**: 100% edge target validity, automatic retry on malformed outputs ### Quality Characteristics **Strengths:** - Comprehensive coverage (99.7% encyclopedia, 97.5% etymology) - Consistent schema and formatting - Rich semantic connectivity (avg 17 edges per sense) - Integrated multi-dimensional content - Rapid iteration capability **Limitations:** - **Synthetic generation**: Reflects LLM training data patterns and biases - **Not expert-validated**: Unlike manually curated resources - **Potential inaccuracies**: Especially in technical domains and etymology - **Contemporary bias**: May lack historical usage nuances - **Schema constraints**: Fixed relationship types may miss subtle semantic distinctions ### Appropriate Use Cases ✅ **Recommended for:** - Educational technology and vocabulary learning - Rapid prototyping of lexical applications - Semantic feature extraction for NLP - Benchmark dataset for definition generation - Resource augmentation (combining with other datasets) - Research on synthetic knowledge resources ⚠️ **Use with caution for:** - Authoritative reference (verify critical information) - Fine-grained semantic analysis requiring expert validation - Historical linguistics research (etymology is plausible but not scholarly) - Domain-specific terminology (may lack precision) ## Comparison with Other Resources | Resource | Senses | Lexemes | Multi-word | Encyclopedic | Etymology | Cost | Update Cycle | |----------|--------|---------|------------|--------------|-----------|------|--------------| | **OpenGloss** | **537K** | **150K** | **37.3%** | **99.7%** | **97.5%** | **<$1K** | **<1 week** | | WordNet 3.1 | 117K | 155K | ~30% | ✗ | ✗ | Manual | Years | | Open English WordNet | 120K | 147K | ~30% | ✗ | ✗ | Manual | Ongoing | | BabelNet | 23M | 23M | Yes | Partial | ✗ | Integration | Ongoing | | ConceptNet | ~1.5M | ~800K | Yes | ✗ | ✗ | Crowdsourced | Ongoing | OpenGloss provides **4.6× more sense definitions** than WordNet while adding encyclopedic and etymological content absent from computational lexicons. **Overlap Analysis:** - OpenGloss ∩ WordNet: 38% vocabulary overlap - Each contributes distinct lexicographic priorities - OpenGloss emphasizes pedagogical vocabulary and multi-word expressions - Complementary rather than redundant coverage ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-dictionary-definitions") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Definition: {record['definition'] if 'definition' in record else record['senses'][0]['definition']}") print(f"Edges: {record.get('total_edges', len(record.get('edges', [])))}\n") ``` ### Filtering Examples ```python # Filter by part of speech nouns = dataset["train"].filter(lambda x: part_of_speech: "noun") # Find highly polysemous words polysemous = dataset["train"].filter( lambda x: x.get("total_senses", x.get("total_senses_for_word", 0)) >= 5 ) # Get entries with encyclopedic content with_encyclopedia = dataset["train"].filter(lambda x: x["has_encyclopedia"]) ``` ## Citation If you use OpenGloss in your research, please cite: ```bibtex @misc{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-dictionary-definitions}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. You are free to: - **Share**: Copy and redistribute the material - **Adapt**: Remix, transform, and build upon the material Under the following terms: - **Attribution**: You must give appropriate credit and indicate if changes were made ## Additional Resources - 📄 **Paper**: Full methodology and analysis (available on arXiv) - 💾 **Alternative View**: [Word-level dataset](https://huggingface.co/datasets/mjbommar/opengloss-dictionary) - 🔗 **Source Code**: [Generation pipeline](https://github.com/mjbommar/opengloss) (if applicable) - 📊 **Statistics**: See paper Section 4 for detailed dataset statistics ## Version History - **v1.0** (2025-01): Initial release - 150,101 lexemes, 536,829 senses - 9.1M semantic edges - 99.7% encyclopedic coverage, 97.5% etymology coverage ## Acknowledgments This dataset was generated using: - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured LLM generation - OpenAI GPT-5-nano for content generation - Anthropic Claude Sonnet 4.5 for quality assurance Portions of this work were prepared with assistance from large language models. The author is solely responsible for all content, including any errors or omissions. ## Contact For questions, issues, or feedback: - **Email**: michael.bommarito@gmail.com - **Dataset Issues**: Use the Hugging Face dataset discussion board --- *Generated from the OpenGloss v1.0 dataset. Last updated: 2025-01*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作