mjbommar/ogbert-v1-contrastive

Name: mjbommar/ogbert-v1-contrastive
Creator: mjbommar
Published: 2025-12-07 02:04:34
License: 暂无描述

Hugging Face2025-12-07 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/mjbommar/ogbert-v1-contrastive

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - sentence-similarity - text-retrieval task_ids: - semantic-similarity-scoring pretty_name: OGBert Contrastive Dataset size_categories: - 1M<n<10M tags: - contrastive-learning - semantic-similarity - word-embeddings - modernbert - lexical - retrieval --- # OGBert Contrastive Combined Dataset This dataset combines contrastive learning signals from two sources: 1. **Contrastive Examples**: Gradient-based semantic similarity pairs 2. **Definitions**: Word-level semantic relationships (synonyms, antonyms, definitions, etc.) ## Dataset Statistics - **Total pairs**: 9,358,022 - **Training pairs**: 8,890,120 - **Evaluation pairs**: 467,902 ### Breakdown by Source **Contrastive Dataset** (500,000 pairs): - Gradient signals: All C(5,2) = 10 pairwise combinations of semantic gradient positions - Positions: 0.0 (antonym), 0.25 (near-antonym), 0.5 (neutral), 0.75 (near-synonym), 1.0 (synonym) **Definitions Dataset** (8,858,022 pairs): - Word ↔ Definition: 8,886 - Word ↔ Examples: 959,203 - Definition ↔ Examples: 959,135 - Word ↔ Synonyms: 1,416,611 - Word ↔ Antonyms: 965,276 - Word ↔ Hypernyms: 947,799 - Word ↔ Hyponyms: 1,246,194 ## Schema Each row contains: - `source_id` (string): Source record identifier for gradient pairs (keeps related pairs together in batches) - `text_a` (string): First text in the pair - `text_b` (string): Second text in the pair - `label` (float): Continuous similarity score [0.0, 1.0] - 0.0 = maximum dissimilarity (antonyms) - 1.0 = maximum similarity (synonyms/identical) - `weight` (float): Training importance weight (always 1.0 - all examples equally weighted) - `signal_type` (string): Source signal type (e.g., "word_synonym", "gradient_0.00_0.75", etc.) **Important for CoSENT Loss**: Gradient pairs (signal_type starting with "gradient_") share the same `source_id`. The dataset is sorted by `source_id` to keep the 10 pairwise combinations from each gradient adjacent. This enables proper cross-pair comparisons in CoSENT/ranking losses that compute pairwise similarities within batches. ## Label Semantics The `label` field represents **semantic similarity**: - **0.0**: Opposite meanings (antonyms) - **0.65**: Hierarchical relationship (hypernym/hyponym) - **0.75**: Contextual similarity (word in example) - **0.85**: Definition grounded in example - **0.9**: Near-synonyms - **1.0**: Perfect semantic equivalence Gradient pairs have labels computed as `1.0 - |pos_a - pos_b|` where positions are on the 0.0-1.0 semantic scale. ## Usage ### Training a contrastive model ```python from datasets import load_dataset from torch.utils.data import DataLoader # Load dataset dataset = load_dataset("mjbommar/ogbert-contrastive-combined-v1") train_data = dataset["train"] # Use with your contrastive learning model # Labels are continuous similarity scores - use MSE loss ``` ### Filtering by signal type ```python # Get only synonym pairs synonyms = dataset["train"].filter(lambda x: x["signal_type"] == "word_synonym") # Get only gradient pairs gradients = dataset["train"].filter(lambda x: x["signal_type"].startswith("gradient_")) # Get strong positives (similarity > 0.8) strong_pos = dataset["train"].filter(lambda x: x["label"] > 0.8) ``` ## Source Datasets - [mjbommar/opengloss-v1.1-contrastive-examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-contrastive-examples) - [mjbommar/opengloss-v1.1-definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-definitions) ## License Same as source datasets (OpenGloss project). ## Citation If you use this dataset, please cite the original OpenGloss project and datasets.

提供机构：

mjbommar

5,000+

优质数据集

54 个

任务类型

进入经典数据集