five

mjbommar/opengloss-v1.3-dictionary

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.3-dictionary
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering - text-classification - feature-extraction language: - en tags: - dictionary - lexicon - wordnet - semantic-network - knowledge-graph - encyclopedic - etymology - synthetic - education size_categories: - 100K<n<1M --- # OpenGloss Dictionary v1.3 (Word-Level) ## Dataset Summary **OpenGloss** is a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. This dataset provides the **words-level view** where each record represents one lexeme (word or multi-word expression). ### Key Statistics - **205,988 lexemes** - **8,479,875 semantic edges** (synonyms, antonyms, hypernyms, hyponyms, collocations, inflections) - **205,983 entries** with encyclopedic content (100.0% coverage) - **194,420 entries** with etymology (94.4% coverage) - **149,734 entries** with Wikipedia frequency data (72.7% coverage) - **100% reading level coverage** (K through PhD scale) - **100% domain tag coverage** (10+ subject domain categories) - **Average 2.75 senses per lexeme** - **Average 41.2 edges per lexeme** ### What's New in v1.3? Compared to OpenGloss v1.2: 1. **Expanded lexicon coverage**: more lexeme records and more definition-level records in the base dictionary exports 2. **Hard negative pairs dataset**: a new calibration-oriented dataset for embedding training and score separation 3. **Larger companion datasets**: expanded query examples, contrastive examples, and encyclopedia variants in the release family 4. **Gap-driven coverage expansion**: broader geography, history, civics, and related weak-domain support carried into the release 5. **Unified release family**: dictionary, definitions, query, contrastive, encyclopedia, and hard-negative datasets aligned under one version ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 165,258 | | adjective | 65,477 | | verb | 39,532 | | adverb | 7,121 | | determiner | 1,511 | | preposition | 1,237 | | interjection | 974 | | pronoun | 397 | | conjunction | 251 | | particle | 19 | | proper noun | 13 | | numeral | 5 | | proper_noun | 4 | | prefix | 2 | | suffix | 1 | | abbreviation | 1 | | adjetivo | 1 | | sustantivo | 1 | ### Edge Type Distribution | Relationship Type | Count | |-------------------|-------| | synonym | 1,651,142 | | collocation | 1,453,581 | | hyponym | 1,317,685 | | hypernym | 1,109,943 | | antonym | 1,036,163 | | etymology_parent | 882,355 | | inflection | 378,445 | | derivation_noun | 279,502 | | derivation_adjective | 163,037 | | derivation_verb | 119,944 | | derivation_adverb | 68,106 | | cognate | 19,972 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.3-dictionary") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Senses: {record['total_senses']}") print(f"Edges: {record['total_edges']}\n") ``` ## Core Fields & Usage Examples ### Wikipedia Frequency Data Filter by word importance using frequency data: ```python # Get high-frequency words (top 10,000) common_words = dataset["train"].filter( lambda x: x["wiki_frequency_rank"] is not None and x["wiki_frequency_rank"] <= 10000 ) # Sort by frequency sorted_by_freq = dataset["train"].sort("wiki_frequency", reverse=True) ``` ### Reading Levels Filter vocabulary by grade level for educational applications: ```python # Elementary (K-5) elementary = dataset["train"].filter(lambda x: x["reading_level"] in ["K", "1", "2", "3", "4", "5"]) # Middle school (6-8) middle_school = dataset["train"].filter(lambda x: x["reading_level"] in ["6", "7", "8"]) # High school (9-12) high_school = dataset["train"].filter(lambda x: x["reading_level"] in ["9", "10", "11", "12"]) # Advanced (BS/PhD) advanced = dataset["train"].filter(lambda x: x["reading_level"] in ["BS", "PhD"]) ``` ### Domain Tags Filter by subject area for content-specific applications: ```python # Science vocabulary science_words = dataset["train"].filter( lambda x: any("science" in tag or "life-sciences" in tag for tag in x.get("tags", [])) ) # Technology vocabulary tech_words = dataset["train"].filter( lambda x: any("technology" in tag for tag in x.get("tags", [])) ) # Social studies social_studies = dataset["train"].filter( lambda x: any(tag.startswith("domain:history") or tag.startswith("domain:society") for tag in x.get("tags", [])) ) ``` ### Etymology Segments Access structured etymology with language trail: ```python # Words with detailed etymology words_with_etymology = dataset["train"].filter(lambda x: len(x.get("etymology_segments", [])) > 0) # Find words from specific language origins latin_origin = dataset["train"].filter( lambda x: any(seg.get("language", "").lower() == "latin" for seg in x.get("etymology_segments", [])) ) ``` ## Citation If you use OpenGloss in your research, please cite: ```bibtex @misc{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.3-dictionary}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Version History - **v1.3** (2026-04): ~206K entries with gap-fill expansion, regenerated lexical explanations with relation data, full companion dataset coverage for top 50K entries, multiple encyclopedia variants - **v1.2** (2026-04): Expanded release with larger companion training datasets and hard-negative calibration pairs - **v1.1** (2025-11): Release with structured morphology, etymology segments, and frequency data - **v1.0** (2025-01): Initial release ## Acknowledgments This dataset was generated using: - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured LLM generation - OpenAI GPT models for content generation - Anthropic Claude for quality assurance --- *Generated from the OpenGloss v1.3 dataset.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作