five

mjbommar/opengloss-v1.2-hard-negative-pairs

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.2-hard-negative-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - sentence-similarity - feature-extraction - text-classification language: - en tags: - opengloss - embeddings - hard-negatives - calibration - retrieval - semantic-similarity size_categories: - 10K<n<100K --- # OpenGloss Hard Negative Pairs v1.2 This dataset contains calibration-oriented positive and low-label similarity pairs for embedding training. It is designed to reduce over-scoring of related-but-wrong matches and improve score separation in weak domains. ## Dataset Summary - Total records: **73,244** - Unique lexemes: **11,522** ## Relation Distribution | Relation Type | Count | |---|---:| | same_domain_wrong_entity | 27,216 | | near_fact_confusion | 22,095 | | style_variant | 11,519 | | true_match | 11,517 | | sibling_concept | 897 | ## Domain Distribution | Domain | Count | |---|---:| | geography | 23,619 | | history | 18,628 | | art | 7,431 | | civics | 4,879 | | biology | 4,758 | | religion | 4,678 | | general_academic | 2,080 | | law | 2,070 | | education | 2,050 | | linguistics | 2,016 | | literature | 605 | | anthropology | 195 | | chemistry | 145 | | technology | 70 | | astronomy | 6 | | psychiatry | 4 | | biology_and_medicine | 2 | | government | 2 | | medicine | 2 | | physics | 2 | | technology_and_neuroscience | 2 | ## Difficulty Distribution | Difficulty | Count | |---|---:| | medium | 38,735 | | hard | 22,992 | | easy | 11,517 | ## Label Distribution | Label | Count | |---|---:| | 0.20 | 27,216 | | 0.35 | 22,992 | | 0.80 | 11,519 | | 1.00 | 11,517 | ## Recommended Use - calibration supervision for embedding models - reducing over-scoring of sibling concepts and same-domain wrong entities - improving low-label score behavior in the 0.20 to 0.35 range ## Record Schema Each record includes: - `id` - `text_a` - `text_b` - `label` - `relation_type` - `domain` - `difficulty` - `anchor_lexeme` - `candidate_lexeme` - `lexeme_id_a` - `lexeme_id_b` - optional `source` - optional `notes` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Related Datasets - [OpenGloss v1.2 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-dictionary) - [OpenGloss v1.2 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-definitions) - [OpenGloss v1.2 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-query-examples) - [OpenGloss v1.2 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.2-contrastive-examples) --- *Generated from the OpenGloss v1.2 lexicon.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作