five

mjbommar/opengloss-v1.2-encyclopedia-variants

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-encyclopedia-variants
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - summarization language: - en tags: - style-transfer - paraphrase - text-simplification - encyclopedia - lexicon - synthetic - education - opengloss - writing-style size_categories: - 10K<n<100K --- # OpenGloss Encyclopedia Variants v1.2 ## Dataset Summary **OpenGloss Encyclopedia Variants** is a synthetic dataset of vocabulary encyclopedia entries rewritten in multiple writing styles. Each record contains an academic base entry alongside a variant rewritten for a specific audience, tone, and content structure. This dataset supports style transfer, text simplification, paraphrase generation, and audience-adaptive content creation. It is derived from the [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) encyclopedic dictionary. ### Key Statistics - **21,487 encyclopedia variants** - **21,388 unique words/phrases** - **21,485 unique lexemes** - **10 writing voices** × **9 tones** × **8 framings** × **3 lengths** - **Average base entry: 260 words** (1899 chars) - **Average variant: 228 words** (1510 chars) ### Style Dimensions Each variant is characterized by four style dimensions: | Dimension | Values | Description | |-----------|--------|-------------| | **Voice** | 10 | Writing persona (teen_explainer, tech_writer, storyteller, etc.) | | **Tone** | 9 | Emotional register (myth_buster, narrative_hook, q_and_a, etc.) | | **Framing** | 8 | Content structure (faq, origin_story, comparisons, etc.) | | **Length** | 3 | Content length (short, medium, long) | ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 18,533 | | adjective | 1,734 | | verb | 947 | | adverb | 205 | | determiner | 26 | | preposition | 19 | | interjection | 12 | | pronoun | 7 | | conjunction | 4 | ### Voice Distribution | Voice | Count | |-------|-------| | teen_explainer | 8,047 | | faq_writer | 1,551 | | podcast_host | 1,515 | | policy_brief | 1,511 | | museum_docent | 1,497 | | friendly_teacher | 1,484 | | science_journalist | 1,483 | | storyteller | 1,481 | | coach | 1,461 | | tech_writer | 1,457 | ### Tone Distribution | Tone | Count | |------|-------| | punchy_bullet_digest | 8,136 | | myth_buster | 1,736 | | plain_neutral | 1,706 | | informal_conversational | 1,690 | | narrative_hook | 1,659 | | analogy_forward | 1,655 | | practical_howto | 1,648 | | q_and_a | 1,630 | | step_by_step | 1,627 | ### Framing Distribution | Framing | Count | |---------|-------| | myth_vs_fact | 8,466 | | origin_story | 1,906 | | comparisons | 1,899 | | use_cases | 1,893 | | why_it_matters | 1,874 | | overview | 1,838 | | how_it_works | 1,827 | | faq | 1,784 | ### Length Distribution | Length | Count | |--------|-------| | long | 11,453 | | medium | 5,050 | | short | 4,984 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.2-encyclopedia-variants") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Voice: {record['voice']} | Tone: {record['tone']}") print(f"Base: {record['base_entry'][:200]}...") print(f"Variant: {record['variant_text'][:200]}...\n") ``` ## Example Record ```python { "id": "orange_twist_storyteller_narrative_hook_faq_short", "word": "orange twist", "lexeme_id": "orange_twist", "pos": "noun", "base_entry": "### **Orange twist**\n\nAn **orange twist** is a slender spiral of orange peel...", "variant_text": "What is an orange twist, really? Not just a curl of peel, but a tiny bit of theater...", "style": { "voice": "storyteller", "tone": "narrative_hook", "framing": "faq", "length_label": "short" }, "voice": "storyteller", "tone": "narrative_hook", "framing": "faq", "length_label": "short" } ``` ## Use Cases ### Style Transfer Training Train models to rewrite text in different styles: ```python # Create style-conditioned pairs for record in dataset["train"]: source = record["base_entry"] target = record["variant_text"] style = f"[VOICE:{record['voice']}] [TONE:{record['tone']}]" # Use for seq2seq training with style prefix ``` ### Text Simplification Filter for simplified variants: ```python # Get accessible versions simplified = dataset["train"].filter( lambda x: x["voice"] in ["teen_explainer", "friendly_teacher"] and x["length_label"] == "short" ) ``` ### Audience-Adaptive Generation Generate content for specific audiences: ```python # Get policy-oriented content policy_variants = dataset["train"].filter( lambda x: x["voice"] == "policy_brief" ) ``` ### Paraphrase Mining Extract semantically equivalent pairs: ```python # Same word, different styles = paraphrases from collections import defaultdict by_word = defaultdict(list) for record in dataset["train"]: by_word[record["word"]].append(record) ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{bommarito2025opengloss_variants, title={OpenGloss Encyclopedia Variants: Multi-Style Vocabulary Explanations}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.2-encyclopedia-variants}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Related Datasets - [OpenGloss v1.2 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary) - Word-level records - [OpenGloss v1.2 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) - Definition-level records - [OpenGloss v1.2 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-contrastive-examples) - Semantic gradients - [OpenGloss v1.2 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-query-examples) - Query-side retrieval supervision - [OpenGloss v1.2 Hard Negative Pairs](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-hard-negative-pairs) - Calibration pairs for embedding training ## Acknowledgments This dataset was generated using: - [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) lexicon data - OpenAI GPT models for style-varied generation - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured generation --- *Generated from the OpenGloss v1.2 lexicon.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作