five

mjbommar/opengloss-v1.2-contrastive-examples

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-contrastive-examples
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - sentence-similarity - feature-extraction language: - en tags: - contrastive-learning - semantic-similarity - lexicon - synonym - antonym - gradient - synthetic - education - opengloss size_categories: - 10K<n<100K --- # OpenGloss Contrastive Examples v1.2 ## Dataset Summary **OpenGloss Contrastive Examples** is a synthetic dataset of graduated semantic variations designed for contrastive learning and semantic similarity training. Each example contains a source sentence and a 5-point semantic gradient showing how meaning shifts from antonym to synonym poles. This dataset is derived from the [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) encyclopedic dictionary, using example sentences and their lexical context to generate semantically coherent variations. ### Key Statistics - **73,153 contrastive examples** - **53,347 unique words/phrases** - **53,446 unique lexemes** - **5 gradient points per example** (0.0, 0.25, 0.5, 0.75, 1.0) - **365,765 total sentence variations** - **Average source sentence length: 10.8 words** - **Average gradient sentence length: 11.1 words** ### Gradient Structure Each example contains a 5-point semantic gradient: | Position | Semantic Pole | Description | |----------|---------------|-------------| | 0.0 | Antonym | Semantic opposite of the target word | | 0.25 | Near-antonym | Closer to opposite meaning | | 0.5 | Neutral | Middle ground, balanced meaning | | 0.75 | Near-synonym | Closer to original meaning | | 1.0 | Synonym | Original word or close synonym | ### POS Distribution | Part of Speech | Count | |----------------|-------| | noun | 51,182 | | adjective | 12,076 | | verb | 8,086 | | adverb | 984 | | determiner | 263 | | preposition | 253 | | interjection | 175 | | pronoun | 78 | | conjunction | 40 | | proper noun | 8 | | particle | 5 | | numeral | 2 | | abbreviation | 1 | ## Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("mjbommar/opengloss-v1.2-contrastive-examples") # Access records for record in dataset["train"]: print(f"Word: {record['word']}") print(f"Source: {record['source_sentence']}") print(f"Antonym: {record['antonym_word']} → {record['antonym_sentence']}") print(f"Synonym: {record['synonym_word']} → {record['synonym_sentence']}\n") ``` ## Example Record ```python { "id": "logjam_1_1_1", "word": "logjam", "lexeme_id": "logjam", "pos": "verb", "source_sentence": "Funding pipelines logjam during policy transitions.", "gradient": [ {"position": 0.0, "word": "flow", "sentence": "Funding pipelines flow smoothly during policy transitions."}, {"position": 0.25, "word": "continue", "sentence": "Funding pipelines continue during policy transitions."}, {"position": 0.5, "word": "slow", "sentence": "Funding pipelines slow during policy transitions."}, {"position": 0.75, "word": "stall", "sentence": "Funding pipelines stall during policy transitions."}, {"position": 1.0, "word": "logjam", "sentence": "Funding pipelines logjam during policy transitions."} ], "antonym_word": "flow", "antonym_sentence": "Funding pipelines flow smoothly during policy transitions.", "near_antonym_word": "continue", "near_antonym_sentence": "Funding pipelines continue during policy transitions.", "neutral_word": "slow", "neutral_sentence": "Funding pipelines slow during policy transitions.", "near_synonym_word": "stall", "near_synonym_sentence": "Funding pipelines stall during policy transitions.", "synonym_word": "logjam", "synonym_sentence": "Funding pipelines logjam during policy transitions." } ``` ## Use Cases ### Contrastive Learning Train models to understand semantic gradients and fine-grained meaning differences: ```python # Create contrastive pairs pairs = [] for record in dataset["train"]: # Pair antonym with synonym for maximum contrast pairs.append((record["antonym_sentence"], record["synonym_sentence"], 0.0)) # Pair neutral with synonym for medium contrast pairs.append((record["neutral_sentence"], record["synonym_sentence"], 0.5)) ``` ### Semantic Similarity Fine-tune embedding models with graduated similarity labels: ```python # Each gradient position maps to a similarity score for record in dataset["train"]: for point in record["gradient"]: similarity = point["position"] # 0.0 to 1.0 sentence = point["sentence"] ``` ### Data Augmentation Use gradient variations for text augmentation while controlling semantic shift: ```python # Get semantically similar alternatives (position >= 0.75) def get_similar_sentences(record): return [p["sentence"] for p in record["gradient"] if p["position"] >= 0.75] ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{bommarito2025opengloss_contrastive, title={OpenGloss Contrastive Examples: Graduated Semantic Variations for Contrastive Learning}, author={Bommarito, Michael J., II}, year={2025}, url={https://huggingface.co/datasets/mjbommar/opengloss-v1.2-contrastive-examples}, note={Dataset available under CC-BY 4.0} } ``` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Related Datasets - [OpenGloss v1.2 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary) - Word-level records - [OpenGloss v1.2 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) - Definition-level records - [OpenGloss v1.2 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-query-examples) - Query-side retrieval supervision - [OpenGloss v1.2 Hard Negative Pairs](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-hard-negative-pairs) - Calibration pairs for embedding training ## Acknowledgments This dataset was generated using: - [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) lexicon data - OpenAI GPT models for gradient generation - [pydantic-ai](https://github.com/pydantic/pydantic-ai) for structured generation --- *Generated from the OpenGloss v1.2 lexicon.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作