mjbommar/opengloss-v1.2-encyclopedia-variants
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-encyclopedia-variants
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- summarization
language:
- en
tags:
- style-transfer
- paraphrase
- text-simplification
- encyclopedia
- lexicon
- synthetic
- education
- opengloss
- writing-style
size_categories:
- 10K<n<100K
---
# OpenGloss Encyclopedia Variants v1.2
## Dataset Summary
**OpenGloss Encyclopedia Variants** is a synthetic dataset of vocabulary encyclopedia entries
rewritten in multiple writing styles. Each record contains an academic base entry alongside
a variant rewritten for a specific audience, tone, and content structure.
This dataset supports style transfer, text simplification, paraphrase generation, and
audience-adaptive content creation. It is derived from the
[OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions)
encyclopedic dictionary.
### Key Statistics
- **21,487 encyclopedia variants**
- **21,388 unique words/phrases**
- **21,485 unique lexemes**
- **10 writing voices** × **9 tones** × **8 framings** × **3 lengths**
- **Average base entry: 260 words** (1899 chars)
- **Average variant: 228 words** (1510 chars)
### Style Dimensions
Each variant is characterized by four style dimensions:
| Dimension | Values | Description |
|-----------|--------|-------------|
| **Voice** | 10 | Writing persona (teen_explainer, tech_writer, storyteller, etc.) |
| **Tone** | 9 | Emotional register (myth_buster, narrative_hook, q_and_a, etc.) |
| **Framing** | 8 | Content structure (faq, origin_story, comparisons, etc.) |
| **Length** | 3 | Content length (short, medium, long) |
### POS Distribution
| Part of Speech | Count |
|----------------|-------|
| noun | 18,533 |
| adjective | 1,734 |
| verb | 947 |
| adverb | 205 |
| determiner | 26 |
| preposition | 19 |
| interjection | 12 |
| pronoun | 7 |
| conjunction | 4 |
### Voice Distribution
| Voice | Count |
|-------|-------|
| teen_explainer | 8,047 |
| faq_writer | 1,551 |
| podcast_host | 1,515 |
| policy_brief | 1,511 |
| museum_docent | 1,497 |
| friendly_teacher | 1,484 |
| science_journalist | 1,483 |
| storyteller | 1,481 |
| coach | 1,461 |
| tech_writer | 1,457 |
### Tone Distribution
| Tone | Count |
|------|-------|
| punchy_bullet_digest | 8,136 |
| myth_buster | 1,736 |
| plain_neutral | 1,706 |
| informal_conversational | 1,690 |
| narrative_hook | 1,659 |
| analogy_forward | 1,655 |
| practical_howto | 1,648 |
| q_and_a | 1,630 |
| step_by_step | 1,627 |
### Framing Distribution
| Framing | Count |
|---------|-------|
| myth_vs_fact | 8,466 |
| origin_story | 1,906 |
| comparisons | 1,899 |
| use_cases | 1,893 |
| why_it_matters | 1,874 |
| overview | 1,838 |
| how_it_works | 1,827 |
| faq | 1,784 |
### Length Distribution
| Length | Count |
|--------|-------|
| long | 11,453 |
| medium | 5,050 |
| short | 4,984 |
## Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("mjbommar/opengloss-v1.2-encyclopedia-variants")
# Access records
for record in dataset["train"]:
print(f"Word: {record['word']}")
print(f"Voice: {record['voice']} | Tone: {record['tone']}")
print(f"Base: {record['base_entry'][:200]}...")
print(f"Variant: {record['variant_text'][:200]}...\n")
```
## Example Record
```python
{
"id": "orange_twist_storyteller_narrative_hook_faq_short",
"word": "orange twist",
"lexeme_id": "orange_twist",
"pos": "noun",
"base_entry": "### **Orange twist**\n\nAn **orange twist** is a slender spiral of orange peel...",
"variant_text": "What is an orange twist, really? Not just a curl of peel, but a tiny bit of theater...",
"style": {
"voice": "storyteller",
"tone": "narrative_hook",
"framing": "faq",
"length_label": "short"
},
"voice": "storyteller",
"tone": "narrative_hook",
"framing": "faq",
"length_label": "short"
}
```
## Use Cases
### Style Transfer Training
Train models to rewrite text in different styles:
```python
# Create style-conditioned pairs
for record in dataset["train"]:
source = record["base_entry"]
target = record["variant_text"]
style = f"[VOICE:{record['voice']}] [TONE:{record['tone']}]"
# Use for seq2seq training with style prefix
```
### Text Simplification
Filter for simplified variants:
```python
# Get accessible versions
simplified = dataset["train"].filter(
lambda x: x["voice"] in ["teen_explainer", "friendly_teacher"]
and x["length_label"] == "short"
)
```
### Audience-Adaptive Generation
Generate content for specific audiences:
```python
# Get policy-oriented content
policy_variants = dataset["train"].filter(
lambda x: x["voice"] == "policy_brief"
)
```
### Paraphrase Mining
Extract semantically equivalent pairs:
```python
# Same word, different styles = paraphrases
from collections import defaultdict
by_word = defaultdict(list)
for record in dataset["train"]:
by_word[record["word"]].append(record)
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{bommarito2025opengloss_variants,
title={OpenGloss Encyclopedia Variants: Multi-Style Vocabulary Explanations},
author={Bommarito, Michael J., II},
year={2025},
url={https://huggingface.co/datasets/mjbommar/opengloss-v1.2-encyclopedia-variants},
note={Dataset available under CC-BY 4.0}
}
```
## License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**.
## Related Datasets
- [OpenGloss v1.2 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary) - Word-level records
- [OpenGloss v1.2 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) - Definition-level records
- [OpenGloss v1.2 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-contrastive-examples) - Semantic gradients
- [OpenGloss v1.2 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-query-examples) - Query-side retrieval supervision
- [OpenGloss v1.2 Hard Negative Pairs](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-hard-negative-pairs) - Calibration pairs for embedding training
## Acknowledgments
This dataset was generated using:
- [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) lexicon data
- OpenAI GPT models for style-varied generation
- [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured generation
---
*Generated from the OpenGloss v1.2 lexicon.*
提供机构:
mjbommar



