mjbommar/opengloss-v1.2-hard-negative-pairs
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-hard-negative-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- sentence-similarity
- feature-extraction
- text-classification
language:
- en
tags:
- opengloss
- embeddings
- hard-negatives
- calibration
- retrieval
- semantic-similarity
size_categories:
- 10K<n<100K
---
# OpenGloss Hard Negative Pairs v1.2
This dataset contains calibration-oriented positive and low-label similarity pairs for embedding training.
It is designed to reduce over-scoring of related-but-wrong matches and improve score separation in weak domains.
## Dataset Summary
- Total records: **73,244**
- Unique lexemes: **11,522**
## Relation Distribution
| Relation Type | Count |
|---|---:|
| same_domain_wrong_entity | 27,216 |
| near_fact_confusion | 22,095 |
| style_variant | 11,519 |
| true_match | 11,517 |
| sibling_concept | 897 |
## Domain Distribution
| Domain | Count |
|---|---:|
| geography | 23,619 |
| history | 18,628 |
| art | 7,431 |
| civics | 4,879 |
| biology | 4,758 |
| religion | 4,678 |
| general_academic | 2,080 |
| law | 2,070 |
| education | 2,050 |
| linguistics | 2,016 |
| literature | 605 |
| anthropology | 195 |
| chemistry | 145 |
| technology | 70 |
| astronomy | 6 |
| psychiatry | 4 |
| biology_and_medicine | 2 |
| government | 2 |
| medicine | 2 |
| physics | 2 |
| technology_and_neuroscience | 2 |
## Difficulty Distribution
| Difficulty | Count |
|---|---:|
| medium | 38,735 |
| hard | 22,992 |
| easy | 11,517 |
## Label Distribution
| Label | Count |
|---|---:|
| 0.20 | 27,216 |
| 0.35 | 22,992 |
| 0.80 | 11,519 |
| 1.00 | 11,517 |
## Recommended Use
- calibration supervision for embedding models
- reducing over-scoring of sibling concepts and same-domain wrong entities
- improving low-label score behavior in the 0.20 to 0.35 range
## Record Schema
Each record includes:
- `id`
- `text_a`
- `text_b`
- `label`
- `relation_type`
- `domain`
- `difficulty`
- `anchor_lexeme`
- `candidate_lexeme`
- `lexeme_id_a`
- `lexeme_id_b`
- optional `source`
- optional `notes`
## License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**.
## Related Datasets
- [OpenGloss v1.2 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary)
- [OpenGloss v1.2 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions)
- [OpenGloss v1.2 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-query-examples)
- [OpenGloss v1.2 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-contrastive-examples)
---
*Generated from the OpenGloss v1.2 lexicon.*
提供机构:
mjbommar



