mjbommar/opengloss-v1.3-dictionary
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.3-dictionary
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
- text-classification
- feature-extraction
language:
- en
tags:
- dictionary
- lexicon
- wordnet
- semantic-network
- knowledge-graph
- encyclopedic
- etymology
- synthetic
- education
size_categories:
- 100K<n<1M
---
# OpenGloss Dictionary v1.3 (Word-Level)
## Dataset Summary
**OpenGloss** is a synthetic encyclopedic dictionary and semantic knowledge graph for English
that integrates lexicographic definitions, encyclopedic context, etymological histories,
and semantic relationships in a unified resource.
This dataset provides the **words-level view** where each record represents one lexeme (word or multi-word expression).
### Key Statistics
- **205,988 lexemes**
- **8,479,875 semantic edges** (synonyms, antonyms, hypernyms, hyponyms, collocations, inflections)
- **205,983 entries** with encyclopedic content (100.0% coverage)
- **194,420 entries** with etymology (94.4% coverage)
- **149,734 entries** with Wikipedia frequency data (72.7% coverage)
- **100% reading level coverage** (K through PhD scale)
- **100% domain tag coverage** (10+ subject domain categories)
- **Average 2.75 senses per lexeme**
- **Average 41.2 edges per lexeme**
### What's New in v1.3?
Compared to OpenGloss v1.2:
1. **Expanded lexicon coverage**: more lexeme records and more definition-level records in the base dictionary exports
2. **Hard negative pairs dataset**: a new calibration-oriented dataset for embedding training and score separation
3. **Larger companion datasets**: expanded query examples, contrastive examples, and encyclopedia variants in the release family
4. **Gap-driven coverage expansion**: broader geography, history, civics, and related weak-domain support carried into the release
5. **Unified release family**: dictionary, definitions, query, contrastive, encyclopedia, and hard-negative datasets aligned under one version
### POS Distribution
| Part of Speech | Count |
|----------------|-------|
| noun | 165,258 |
| adjective | 65,477 |
| verb | 39,532 |
| adverb | 7,121 |
| determiner | 1,511 |
| preposition | 1,237 |
| interjection | 974 |
| pronoun | 397 |
| conjunction | 251 |
| particle | 19 |
| proper noun | 13 |
| numeral | 5 |
| proper_noun | 4 |
| prefix | 2 |
| suffix | 1 |
| abbreviation | 1 |
| adjetivo | 1 |
| sustantivo | 1 |
### Edge Type Distribution
| Relationship Type | Count |
|-------------------|-------|
| synonym | 1,651,142 |
| collocation | 1,453,581 |
| hyponym | 1,317,685 |
| hypernym | 1,109,943 |
| antonym | 1,036,163 |
| etymology_parent | 882,355 |
| inflection | 378,445 |
| derivation_noun | 279,502 |
| derivation_adjective | 163,037 |
| derivation_verb | 119,944 |
| derivation_adverb | 68,106 |
| cognate | 19,972 |
## Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("mjbommar/opengloss-v1.3-dictionary")
# Access records
for record in dataset["train"]:
print(f"Word: {record['word']}")
print(f"Senses: {record['total_senses']}")
print(f"Edges: {record['total_edges']}\n")
```
## Core Fields & Usage Examples
### Wikipedia Frequency Data
Filter by word importance using frequency data:
```python
# Get high-frequency words (top 10,000)
common_words = dataset["train"].filter(
lambda x: x["wiki_frequency_rank"] is not None and x["wiki_frequency_rank"] <= 10000
)
# Sort by frequency
sorted_by_freq = dataset["train"].sort("wiki_frequency", reverse=True)
```
### Reading Levels
Filter vocabulary by grade level for educational applications:
```python
# Elementary (K-5)
elementary = dataset["train"].filter(lambda x: x["reading_level"] in ["K", "1", "2", "3", "4", "5"])
# Middle school (6-8)
middle_school = dataset["train"].filter(lambda x: x["reading_level"] in ["6", "7", "8"])
# High school (9-12)
high_school = dataset["train"].filter(lambda x: x["reading_level"] in ["9", "10", "11", "12"])
# Advanced (BS/PhD)
advanced = dataset["train"].filter(lambda x: x["reading_level"] in ["BS", "PhD"])
```
### Domain Tags
Filter by subject area for content-specific applications:
```python
# Science vocabulary
science_words = dataset["train"].filter(
lambda x: any("science" in tag or "life-sciences" in tag for tag in x.get("tags", []))
)
# Technology vocabulary
tech_words = dataset["train"].filter(
lambda x: any("technology" in tag for tag in x.get("tags", []))
)
# Social studies
social_studies = dataset["train"].filter(
lambda x: any(tag.startswith("domain:history") or tag.startswith("domain:society")
for tag in x.get("tags", []))
)
```
### Etymology Segments
Access structured etymology with language trail:
```python
# Words with detailed etymology
words_with_etymology = dataset["train"].filter(lambda x: len(x.get("etymology_segments", [])) > 0)
# Find words from specific language origins
latin_origin = dataset["train"].filter(
lambda x: any(seg.get("language", "").lower() == "latin"
for seg in x.get("etymology_segments", []))
)
```
## Citation
If you use OpenGloss in your research, please cite:
```bibtex
@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Bommarito, Michael J., II},
year={2025},
url={https://huggingface.co/datasets/mjbommar/opengloss-v1.3-dictionary},
note={Dataset available under CC-BY 4.0}
}
```
## License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**.
## Version History
- **v1.3** (2026-04): ~206K entries with gap-fill expansion, regenerated lexical explanations with relation data, full companion dataset coverage for top 50K entries, multiple encyclopedia variants
- **v1.2** (2026-04): Expanded release with larger companion training datasets and hard-negative calibration pairs
- **v1.1** (2025-11): Release with structured morphology, etymology segments, and frequency data
- **v1.0** (2025-01): Initial release
## Acknowledgments
This dataset was generated using:
- [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured LLM generation
- OpenAI GPT models for content generation
- Anthropic Claude for quality assurance
---
*Generated from the OpenGloss v1.3 dataset.*
提供机构:
mjbommar



