five

mjbommar/opengloss-v1.3-encyclopedia-variants

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.3-encyclopedia-variants
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - summarization language: - en tags: - style-transfer - paraphrase - text-simplification - encyclopedia - lexicon - synthetic - education - opengloss - writing-style size_categories: - 10K<n<100K --- # OpenGloss Encyclopedia Variants v1.3 ## Dataset Summary **OpenGloss Encyclopedia Variants** is a synthetic dataset of vocabulary encyclopedia entries rewritten in multiple writing styles. Each record contains an academic base entry alongside a variant rewritten for a specific audience, tone, and content structure. This dataset supports style transfer, text simplification, paraphrase generation, and audience-adaptive content creation. It is derived from the [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions) encyclopedic dictionary. ### Key Statistics - **133,135 encyclopedia variants** - **133,094 unique words/phrases** - **133,135 unique lexemes** - **10 writing voices** × **9 tones** × **8 framings** × **3 lengths** - **Average base entry: 297 words** (2195 chars) - **Average variant: 130 words** (886 chars) ### Style Dimensions Each variant is characterized by four style dimensions: | Dimension | Values | Description | |-----------|--------|-------------| | **Voice** | 10 | Writing persona (teen_explainer, tech_writer, storyteller, etc.) | | **Tone** | 9 | Emotional register (myth_buster, narrative_hook, q_and_a, etc.) | | **Framing** | 8 | Content structure (faq, origin_story, comparisons, etc.) | | **Length** | 3 | Content length (short, medium, long) | ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 95,287 | | adjective | 23,306 | | verb | 10,862 | | adverb | 2,959 | | determiner | 364 | | preposition | 166 | | interjection | 96 | | pronoun | 62 | | conjunction | 30 | | numeral | 2 | | adjetivo | 1 | ### Voice Distribution | Voice | Count | |-------|-------| | science_journalist | 133,135 | ### Tone Distribution | Tone | Count | |------|-------| | plain_neutral | 133,135 | ### Framing Distribution | Framing | Count | |---------|-------| | use_cases | 133,135 | ### Length Distribution | Length | Count | |--------|-------| | short | 133,135 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.3-encyclopedia-variants") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Voice: {record['voice']} | Tone: {record['tone']}") print(f"Base: {record['base_entry'][:200]}...") print(f"Variant: {record['variant_text'][:200]}...\n") ``` ## Example Record ```python { "id": "orange_twist_storyteller_narrative_hook_faq_short", "word": "orange twist", "lexeme_id": "orange_twist", "pos": "noun", "base_entry": "### **Orange twist**\n\nAn **orange twist** is a slender spiral of orange peel...", "variant_text": "What is an orange twist, really? Not just a curl of peel, but a tiny bit of theater...", "style": { "voice": "storyteller", "tone": "narrative_hook", "framing": "faq", "length_label": "short" }, "voice": "storyteller", "tone": "narrative_hook", "framing": "faq", "length_label": "short" } ``` ## Use Cases ### Style Transfer Training Train models to rewrite text in different styles: ```python # Create style-conditioned pairs for record in dataset["train"]: source = record["base_entry"] target = record["variant_text"] style = f"[VOICE:{record['voice']}] [TONE:{record['tone']}]" # Use for seq2seq training with style prefix ``` ### Text Simplification Filter for simplified variants: ```python # Get accessible versions simplified = dataset["train"].filter( lambda x: x["voice"] in ["teen_explainer", "friendly_teacher"] and x["length_label"] == "short" ) ``` ### Audience-Adaptive Generation Generate content for specific audiences: ```python # Get policy-oriented content policy_variants = dataset["train"].filter( lambda x: x["voice"] == "policy_brief" ) ``` ### Paraphrase Mining Extract semantically equivalent pairs: ```python # Same word, different styles = paraphrases from collections import defaultdict by_word = defaultdict(list) for record in dataset["train"]: by_word[record["word"]].append(record) ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{bommarito2025opengloss_variants, title={OpenGloss Encyclopedia Variants: Multi-Style Vocabulary Explanations}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.3-encyclopedia-variants}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Related Datasets - [OpenGloss v1.3 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-dictionary) - Word-level records - [OpenGloss v1.3 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions) - Definition-level records - [OpenGloss v1.3 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-contrastive-examples) - Semantic gradients - [OpenGloss v1.3 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-query-examples) - Query-side retrieval supervision - [OpenGloss v1.3 Hard Negative Pairs](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-hard-negative-pairs) - Calibration pairs for embedding training ## Acknowledgments This dataset was generated using: - [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions) lexicon data - OpenAI GPT models for style-varied generation - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured generation --- *Generated from the OpenGloss v1.3 lexicon.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作