mjbommar/opengloss-v1.2-dictionary
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-dictionary
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
- text-classification
- feature-extraction
language:
- en
tags:
- dictionary
- lexicon
- wordnet
- semantic-network
- knowledge-graph
- encyclopedic
- etymology
- synthetic
- education
size_categories:
- 100K<n<1M
---
# OpenGloss Dictionary v1.2 (Word-Level)
## Dataset Summary
**OpenGloss** is a synthetic encyclopedic dictionary and semantic knowledge graph for English
that integrates lexicographic definitions, encyclopedic context, etymological histories,
and semantic relationships in a unified resource.
This dataset provides the **words-level view** where each record represents one lexeme (word or multi-word expression).
### Key Statistics
- **162,314 lexemes**
- **7,798,653 semantic edges** (synonyms, antonyms, hypernyms, hyponyms, collocations, inflections)
- **162,314 entries** with encyclopedic content (100.0% coverage)
- **150,746 entries** with etymology (92.9% coverage)
- **149,734 entries** with Wikipedia frequency data (92.2% coverage)
- **100% reading level coverage** (K through PhD scale)
- **100% domain tag coverage** (10+ subject domain categories)
- **Average 3.19 senses per lexeme**
- **Average 48.0 edges per lexeme**
### What's New in v1.2?
Compared to OpenGloss v1.1:
1. **Expanded lexicon coverage**: more lexeme records and more definition-level records in the base dictionary exports
2. **Hard negative pairs dataset**: a new calibration-oriented dataset for embedding training and score separation
3. **Larger companion datasets**: expanded query examples, contrastive examples, and encyclopedia variants in the release family
4. **Gap-driven coverage expansion**: broader geography, history, civics, and related weak-domain support carried into the release
5. **Unified release family**: dictionary, definitions, query, contrastive, encyclopedia, and hard-negative datasets aligned under one version
### POS Distribution
| Part of Speech | Count |
|----------------|-------|
| noun | 134,241 |
| adjective | 55,955 |
| verb | 36,423 |
| adverb | 5,583 |
| determiner | 1,510 |
| preposition | 1,234 |
| interjection | 941 |
| pronoun | 395 |
| conjunction | 249 |
| particle | 18 |
| proper noun | 12 |
| numeral | 5 |
| proper_noun | 4 |
| prefix | 2 |
| suffix | 1 |
| adjetivo | 1 |
| sustantivo | 1 |
| abbreviation | 1 |
### Edge Type Distribution
| Relationship Type | Count |
|-------------------|-------|
| synonym | 1,512,031 |
| hyponym | 1,285,609 |
| collocation | 1,273,658 |
| hypernym | 1,018,466 |
| antonym | 1,007,062 |
| etymology_parent | 697,921 |
| inflection | 353,345 |
| derivation_noun | 279,502 |
| derivation_adjective | 163,037 |
| derivation_verb | 119,944 |
| derivation_adverb | 68,106 |
| cognate | 19,972 |
## Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("mjbommar/opengloss-v1.2-dictionary")
# Access records
for record in dataset["train"]:
print(f"Word: {record['word']}")
print(f"Senses: {record['total_senses']}")
print(f"Edges: {record['total_edges']}\n")
```
## Core Fields & Usage Examples
### Wikipedia Frequency Data
Filter by word importance using frequency data:
```python
# Get high-frequency words (top 10,000)
common_words = dataset["train"].filter(
lambda x: x["wiki_frequency_rank"] is not None and x["wiki_frequency_rank"] <= 10000
)
# Sort by frequency
sorted_by_freq = dataset["train"].sort("wiki_frequency", reverse=True)
```
### Reading Levels
Filter vocabulary by grade level for educational applications:
```python
# Elementary (K-5)
elementary = dataset["train"].filter(lambda x: x["reading_level"] in ["K", "1", "2", "3", "4", "5"])
# Middle school (6-8)
middle_school = dataset["train"].filter(lambda x: x["reading_level"] in ["6", "7", "8"])
# High school (9-12)
high_school = dataset["train"].filter(lambda x: x["reading_level"] in ["9", "10", "11", "12"])
# Advanced (BS/PhD)
advanced = dataset["train"].filter(lambda x: x["reading_level"] in ["BS", "PhD"])
```
### Domain Tags
Filter by subject area for content-specific applications:
```python
# Science vocabulary
science_words = dataset["train"].filter(
lambda x: any("science" in tag or "life-sciences" in tag for tag in x.get("tags", []))
)
# Technology vocabulary
tech_words = dataset["train"].filter(
lambda x: any("technology" in tag for tag in x.get("tags", []))
)
# Social studies
social_studies = dataset["train"].filter(
lambda x: any(tag.startswith("domain:history") or tag.startswith("domain:society")
for tag in x.get("tags", []))
)
```
### Etymology Segments
Access structured etymology with language trail:
```python
# Words with detailed etymology
words_with_etymology = dataset["train"].filter(lambda x: len(x.get("etymology_segments", [])) > 0)
# Find words from specific language origins
latin_origin = dataset["train"].filter(
lambda x: any(seg.get("language", "").lower() == "latin"
for seg in x.get("etymology_segments", []))
)
```
## Citation
If you use OpenGloss in your research, please cite:
```bibtex
@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Bommarito, Michael J., II},
year={2025},
url={https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary},
note={Dataset available under CC-BY 4.0}
}
```
## License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**.
## Version History
- **v1.2** (2026-04): Expanded release with larger companion training datasets and hard-negative calibration pairs
- **v1.1** (2025-11): Release with structured morphology, etymology segments, and frequency data
- **v1.0** (2025-01): Initial release
## Acknowledgments
This dataset was generated using:
- [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured LLM generation
- OpenAI GPT models for content generation
- Anthropic Claude for quality assurance
---
*Generated from the OpenGloss v1.2 dataset.*
提供机构:
mjbommar



