five

mjbommar/opengloss-v1.1-encyclopedia-variants

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.1-encyclopedia-variants
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - text2text-generation - summarization language: - en tags: - style-transfer - paraphrase - text-simplification - encyclopedia - lexicon - synthetic - education - opengloss - writing-style size_categories: - 1K<n<10K --- # OpenGloss Encyclopedia Variants v1.1 ## Dataset Summary **OpenGloss Encyclopedia Variants** is a synthetic dataset of vocabulary encyclopedia entries rewritten in multiple writing styles. Each record contains an academic base entry alongside a variant rewritten for a specific audience, tone, and content structure. This dataset supports style transfer, text simplification, paraphrase generation, and audience-adaptive content creation. It is derived from the [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-definitions) encyclopedic dictionary. ### Key Statistics - **9,963 encyclopedia variants** - **9,963 unique words/phrases** - **9,963 unique lexemes** - **10 writing voices** × **9 tones** × **8 framings** × **3 lengths** - **Average base entry: 335 words** (2494 chars) - **Average variant: 216 words** (1434 chars) ### Style Dimensions Each variant is characterized by four style dimensions: | Dimension | Values | Description | |-----------|--------|-------------| | **Voice** | 10 | Writing persona (teen_explainer, tech_writer, storyteller, etc.) | | **Tone** | 9 | Emotional register (myth_buster, narrative_hook, q_and_a, etc.) | | **Framing** | 8 | Content structure (faq, origin_story, comparisons, etc.) | | **Length** | 3 | Content length (short, medium, long) | ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 7,043 | | adjective | 1,703 | | verb | 946 | | adverb | 205 | | determiner | 26 | | preposition | 19 | | interjection | 10 | | pronoun | 7 | | conjunction | 4 | ### Voice Distribution | Voice | Count | |-------|-------| | teen_explainer | 1,031 | | policy_brief | 1,013 | | faq_writer | 1,004 | | museum_docent | 1,003 | | science_journalist | 1,000 | | friendly_teacher | 999 | | tech_writer | 994 | | podcast_host | 988 | | storyteller | 969 | | coach | 962 | ### Tone Distribution | Tone | Count | |------|-------| | myth_buster | 1,155 | | informal_conversational | 1,130 | | plain_neutral | 1,129 | | narrative_hook | 1,117 | | punchy_bullet_digest | 1,113 | | analogy_forward | 1,097 | | practical_howto | 1,091 | | q_and_a | 1,071 | | step_by_step | 1,060 | ### Framing Distribution | Framing | Count | |---------|-------| | myth_vs_fact | 1,283 | | comparisons | 1,275 | | use_cases | 1,270 | | why_it_matters | 1,260 | | overview | 1,234 | | how_it_works | 1,223 | | origin_story | 1,220 | | faq | 1,198 | ### Length Distribution | Length | Count | |--------|-------| | medium | 3,337 | | short | 3,326 | | long | 3,300 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.1-encyclopedia-variants") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Voice: {record['voice']} | Tone: {record['tone']}") print(f"Base: {record['base_entry'][:200]}...") print(f"Variant: {record['variant_text'][:200]}...\n") ``` ## Example Record ```python { "id": "orange_twist_storyteller_narrative_hook_faq_short", "word": "orange twist", "lexeme_id": "orange_twist", "pos": "noun", "base_entry": "### **Orange twist**\n\nAn **orange twist** is a slender spiral of orange peel...", "variant_text": "What is an orange twist, really? Not just a curl of peel, but a tiny bit of theater...", "style": { "voice": "storyteller", "tone": "narrative_hook", "framing": "faq", "length_label": "short" }, "voice": "storyteller", "tone": "narrative_hook", "framing": "faq", "length_label": "short" } ``` ## Use Cases ### Style Transfer Training Train models to rewrite text in different styles: ```python # Create style-conditioned pairs for record in dataset["train"]: source = record["base_entry"] target = record["variant_text"] style = f"[VOICE:{record['voice']}] [TONE:{record['tone']}]" # Use for seq2seq training with style prefix ``` ### Text Simplification Filter for simplified variants: ```python # Get accessible versions simplified = dataset["train"].filter( lambda x: x["voice"] in ["teen_explainer", "friendly_teacher"] and x["length_label"] == "short" ) ``` ### Audience-Adaptive Generation Generate content for specific audiences: ```python # Get policy-oriented content policy_variants = dataset["train"].filter( lambda x: x["voice"] == "policy_brief" ) ``` ### Paraphrase Mining Extract semantically equivalent pairs: ```python # Same word, different styles = paraphrases from collections import defaultdict by_word = defaultdict(list) for record in dataset["train"]: by_word[record["word"]].append(record) ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{bommarito2025opengloss_variants, title={OpenGloss Encyclopedia Variants: Multi-Style Vocabulary Explanations}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.1-encyclopedia-variants}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Related Datasets - [OpenGloss v1.1 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary) - Word-level records - [OpenGloss v1.1 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-definitions) - Definition-level records - [OpenGloss v1.1 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-contrastive-examples) - Semantic gradients ## Acknowledgments This dataset was generated using: - [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-definitions) lexicon data - OpenAI GPT models for style-varied generation - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured generation --- *Generated from the OpenGloss v1.1 lexicon.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作