mjbommar/ogbert-v1-contrastive
收藏Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/ogbert-v1-contrastive
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- sentence-similarity
- text-retrieval
task_ids:
- semantic-similarity-scoring
pretty_name: OGBert Contrastive Dataset
size_categories:
- 1M<n<10M
tags:
- contrastive-learning
- semantic-similarity
- word-embeddings
- modernbert
- lexical
- retrieval
---
# OGBert Contrastive Combined Dataset
This dataset combines contrastive learning signals from two sources:
1. **Contrastive Examples**: Gradient-based semantic similarity pairs
2. **Definitions**: Word-level semantic relationships (synonyms, antonyms, definitions, etc.)
## Dataset Statistics
- **Total pairs**: 9,358,022
- **Training pairs**: 8,890,120
- **Evaluation pairs**: 467,902
### Breakdown by Source
**Contrastive Dataset** (500,000 pairs):
- Gradient signals: All C(5,2) = 10 pairwise combinations of semantic gradient positions
- Positions: 0.0 (antonym), 0.25 (near-antonym), 0.5 (neutral), 0.75 (near-synonym), 1.0 (synonym)
**Definitions Dataset** (8,858,022 pairs):
- Word ↔ Definition: 8,886
- Word ↔ Examples: 959,203
- Definition ↔ Examples: 959,135
- Word ↔ Synonyms: 1,416,611
- Word ↔ Antonyms: 965,276
- Word ↔ Hypernyms: 947,799
- Word ↔ Hyponyms: 1,246,194
## Schema
Each row contains:
- `source_id` (string): Source record identifier for gradient pairs (keeps related pairs together in batches)
- `text_a` (string): First text in the pair
- `text_b` (string): Second text in the pair
- `label` (float): Continuous similarity score [0.0, 1.0]
- 0.0 = maximum dissimilarity (antonyms)
- 1.0 = maximum similarity (synonyms/identical)
- `weight` (float): Training importance weight (always 1.0 - all examples equally weighted)
- `signal_type` (string): Source signal type (e.g., "word_synonym", "gradient_0.00_0.75", etc.)
**Important for CoSENT Loss**: Gradient pairs (signal_type starting with "gradient_") share the same `source_id`.
The dataset is sorted by `source_id` to keep the 10 pairwise combinations from each gradient adjacent.
This enables proper cross-pair comparisons in CoSENT/ranking losses that compute pairwise similarities within batches.
## Label Semantics
The `label` field represents **semantic similarity**:
- **0.0**: Opposite meanings (antonyms)
- **0.65**: Hierarchical relationship (hypernym/hyponym)
- **0.75**: Contextual similarity (word in example)
- **0.85**: Definition grounded in example
- **0.9**: Near-synonyms
- **1.0**: Perfect semantic equivalence
Gradient pairs have labels computed as `1.0 - |pos_a - pos_b|` where positions are on the 0.0-1.0 semantic scale.
## Usage
### Training a contrastive model
```python
from datasets import load_dataset
from torch.utils.data import DataLoader
# Load dataset
dataset = load_dataset("mjbommar/ogbert-contrastive-combined-v1")
train_data = dataset["train"]
# Use with your contrastive learning model
# Labels are continuous similarity scores - use MSE loss
```
### Filtering by signal type
```python
# Get only synonym pairs
synonyms = dataset["train"].filter(lambda x: x["signal_type"] == "word_synonym")
# Get only gradient pairs
gradients = dataset["train"].filter(lambda x: x["signal_type"].startswith("gradient_"))
# Get strong positives (similarity > 0.8)
strong_pos = dataset["train"].filter(lambda x: x["label"] > 0.8)
```
## Source Datasets
- [mjbommar/opengloss-v1.1-contrastive-examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-contrastive-examples)
- [mjbommar/opengloss-v1.1-definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-definitions)
## License
Same as source datasets (OpenGloss project).
## Citation
If you use this dataset, please cite the original OpenGloss project and datasets.
提供机构:
mjbommar



