mjbommar/opengloss-v1.2-dictionary

Name: mjbommar/opengloss-v1.2-dictionary
Creator: mjbommar
Published: 2026-04-08 17:16:47
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-dictionary

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - question-answering - text-classification - feature-extraction language: - en tags: - dictionary - lexicon - wordnet - semantic-network - knowledge-graph - encyclopedic - etymology - synthetic - education size_categories: - 100K<n<1M --- # OpenGloss Dictionary v1.2 (Word-Level) ## Dataset Summary **OpenGloss** is a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. This dataset provides the **words-level view** where each record represents one lexeme (word or multi-word expression). ### Key Statistics - **162,314 lexemes** - **7,798,653 semantic edges** (synonyms, antonyms, hypernyms, hyponyms, collocations, inflections) - **162,314 entries** with encyclopedic content (100.0% coverage) - **150,746 entries** with etymology (92.9% coverage) - **149,734 entries** with Wikipedia frequency data (92.2% coverage) - **100% reading level coverage** (K through PhD scale) - **100% domain tag coverage** (10+ subject domain categories) - **Average 3.19 senses per lexeme** - **Average 48.0 edges per lexeme** ### What's New in v1.2? Compared to OpenGloss v1.1: 1. **Expanded lexicon coverage**: more lexeme records and more definition-level records in the base dictionary exports 2. **Hard negative pairs dataset**: a new calibration-oriented dataset for embedding training and score separation 3. **Larger companion datasets**: expanded query examples, contrastive examples, and encyclopedia variants in the release family 4. **Gap-driven coverage expansion**: broader geography, history, civics, and related weak-domain support carried into the release 5. **Unified release family**: dictionary, definitions, query, contrastive, encyclopedia, and hard-negative datasets aligned under one version ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 134,241 | | adjective | 55,955 | | verb | 36,423 | | adverb | 5,583 | | determiner | 1,510 | | preposition | 1,234 | | interjection | 941 | | pronoun | 395 | | conjunction | 249 | | particle | 18 | | proper noun | 12 | | numeral | 5 | | proper_noun | 4 | | prefix | 2 | | suffix | 1 | | adjetivo | 1 | | sustantivo | 1 | | abbreviation | 1 | ### Edge Type Distribution | Relationship Type | Count | |-------------------|-------| | synonym | 1,512,031 | | hyponym | 1,285,609 | | collocation | 1,273,658 | | hypernym | 1,018,466 | | antonym | 1,007,062 | | etymology_parent | 697,921 | | inflection | 353,345 | | derivation_noun | 279,502 | | derivation_adjective | 163,037 | | derivation_verb | 119,944 | | derivation_adverb | 68,106 | | cognate | 19,972 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.2-dictionary") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Senses: {record['total_senses']}") print(f"Edges: {record['total_edges']}\n") ``` ## Core Fields & Usage Examples ### Wikipedia Frequency Data Filter by word importance using frequency data: ```python # Get high-frequency words (top 10,000) common_words = dataset["train"].filter( lambda x: x["wiki_frequency_rank"] is not None and x["wiki_frequency_rank"] <= 10000 ) # Sort by frequency sorted_by_freq = dataset["train"].sort("wiki_frequency", reverse=True) ``` ### Reading Levels Filter vocabulary by grade level for educational applications: ```python # Elementary (K-5) elementary = dataset["train"].filter(lambda x: x["reading_level"] in ["K", "1", "2", "3", "4", "5"]) # Middle school (6-8) middle_school = dataset["train"].filter(lambda x: x["reading_level"] in ["6", "7", "8"]) # High school (9-12) high_school = dataset["train"].filter(lambda x: x["reading_level"] in ["9", "10", "11", "12"]) # Advanced (BS/PhD) advanced = dataset["train"].filter(lambda x: x["reading_level"] in ["BS", "PhD"]) ``` ### Domain Tags Filter by subject area for content-specific applications: ```python # Science vocabulary science_words = dataset["train"].filter( lambda x: any("science" in tag or "life-sciences" in tag for tag in x.get("tags", [])) ) # Technology vocabulary tech_words = dataset["train"].filter( lambda x: any("technology" in tag for tag in x.get("tags", [])) ) # Social studies social_studies = dataset["train"].filter( lambda x: any(tag.startswith("domain:history") or tag.startswith("domain:society") for tag in x.get("tags", [])) ) ``` ### Etymology Segments Access structured etymology with language trail: ```python # Words with detailed etymology words_with_etymology = dataset["train"].filter(lambda x: len(x.get("etymology_segments", [])) > 0) # Find words from specific language origins latin_origin = dataset["train"].filter( lambda x: any(seg.get("language", "").lower() == "latin" for seg in x.get("etymology_segments", [])) ) ``` ## Citation If you use OpenGloss in your research, please cite: ```bibtex @misc{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Version History - **v1.2** (2026-04): Expanded release with larger companion training datasets and hard-negative calibration pairs - **v1.1** (2025-11): Release with structured morphology, etymology segments, and frequency data - **v1.0** (2025-01): Initial release ## Acknowledgments This dataset was generated using: - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured LLM generation - OpenAI GPT models for content generation - Anthropic Claude for quality assurance --- *Generated from the OpenGloss v1.2 dataset.*

提供机构：

mjbommar

5,000+

优质数据集

54 个

任务类型

进入经典数据集