mjbommar/opengloss-v1.3-hard-negative-pairs
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.3-hard-negative-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- sentence-similarity
- feature-extraction
- text-classification
language:
- en
tags:
- opengloss
- embeddings
- hard-negatives
- calibration
- retrieval
- semantic-similarity
size_categories:
- 10K<n<100K
---
# OpenGloss Hard Negative Pairs v1.3
This dataset contains calibration-oriented positive and low-label similarity pairs for embedding training.
It is designed to reduce over-scoring of related-but-wrong matches and improve score separation in weak domains.
## Dataset Summary
- Total records: **1,131,241**
- Unique lexemes: **205,967**
## Relation Distribution
| Relation Type | Count |
|---|---:|
| same_domain_wrong_entity | 566,913 |
| style_variant | 205,963 |
| true_match | 205,956 |
| near_fact_confusion | 137,232 |
| sibling_concept | 15,177 |
## Domain Distribution
| Domain | Count |
|---|---:|
| general | 737,516 |
| history | 145,655 |
| geography | 144,860 |
| linguistics | 35,517 |
| art | 10,134 |
| religion | 9,736 |
| science | 7,550 |
| language | 7,265 |
| education | 6,258 |
| technology | 4,830 |
| civics | 3,885 |
| anthropology | 2,790 |
| life-sciences | 2,350 |
| society | 2,065 |
| mathematics | 1,590 |
| law | 1,440 |
| economics | 1,124 |
| arts | 1,025 |
| biology | 782 |
| literature | 690 |
| philosophy | 610 |
| food | 445 |
| sports | 420 |
| medicine | 375 |
| chemistry | 310 |
| zoology | 290 |
| pharmacology | 170 |
| politics | 160 |
| botany | 110 |
| anatomy | 105 |
| physics | 65 |
| music | 60 |
| mineralogy | 55 |
| entomology | 50 |
| biochemistry | 45 |
| taxonomy | 45 |
| astronomy | 40 |
| materials_science | 40 |
| psychology | 35 |
| genetics | 30 |
| geology | 30 |
| molecular_biology | 30 |
| architecture | 25 |
| ophthalmology | 25 |
| ornithology | 25 |
| computing | 20 |
| construction | 20 |
| ichthyology | 20 |
| life_sciences | 20 |
| mycology | 20 |
| neuroscience | 20 |
| textiles | 20 |
| clothing | 12 |
| agriculture | 6 |
| archaeology | 6 |
| biography | 6 |
| bullfighting | 6 |
| construction_engineering | 6 |
| ecology | 6 |
| economics_and_accounting | 6 |
| endocrinology | 6 |
| fantasy | 6 |
| food_and_drink | 6 |
| history_and_literature | 6 |
| history_and_politics | 6 |
| informal_speech | 6 |
| law_/_civics | 6 |
| legal_and_financial | 6 |
| meteorology | 6 |
| military | 6 |
| mythology | 6 |
| onomastics | 6 |
| politics_and_law | 6 |
| publishing | 6 |
| rhetoric | 6 |
| transport | 6 |
| transportation | 6 |
| typography | 6 |
| psychiatry | 4 |
| accounting_and_documentation | 2 |
| acoustics | 2 |
| administrative_communication | 2 |
| aeronautics | 2 |
| animal_husbandry | 2 |
| architecture,_housing | 2 |
| art_and_literature | 2 |
| art_and_media | 2 |
| art_history | 2 |
| aviation | 2 |
| aviation_security | 2 |
| biology_and_medicine | 2 |
| biology_genetics | 2 |
| business | 2 |
| business,_economics,_management | 2 |
| cardiology | 2 |
| cartography | 2 |
| ceramics | 2 |
| chronology | 2 |
| classical_studies | 2 |
| color | 2 |
| comics | 2 |
| communication | 2 |
| crafts | 2 |
| crafts_and_restoration | 2 |
| cryptography | 2 |
| culinary | 2 |
| dance | 2 |
| domestic_service | 2 |
| earth_science | 2 |
| ecology_and_zoology | 2 |
| economics,_operations_management | 2 |
| economics_and_law | 2 |
| electrical_engineering | 2 |
| electronics | 2 |
| embryology | 2 |
| environmental_science | 2 |
| equine | 2 |
| fashion | 2 |
| film_and_media | 2 |
| finance | 2 |
| finance_technology | 2 |
| food_and_language | 2 |
| food_preparation | 2 |
| food_science | 2 |
| forestry | 2 |
| funerary | 2 |
| furniture_and_restoration | 2 |
| government | 2 |
| graph_theory | 2 |
| heraldry | 2 |
| historical_administration | 2 |
| historical_and_honorific_usage | 2 |
| historical_and_institutional | 2 |
| historical_architecture | 2 |
| historical_social_welfare | 2 |
| historical_transport | 2 |
| historical_weapons | 2 |
| history_and_classical_mythology | 2 |
| history_and_geography | 2 |
| history_and_philosophy | 2 |
| human-computer_interaction,_interface_design | 2 |
| kinship | 2 |
| language_and_general | 2 |
| language_and_printing | 2 |
| language_and_visual_perception | 2 |
| law,_finance,_administration | 2 |
| law,_politics | 2 |
| law/politics | 2 |
| law_and_civics | 2 |
| law_and_politics | 2 |
| law_enforcement | 2 |
| legal_and_general | 2 |
| legal_and_general_use | 2 |
| legal_and_political | 2 |
| limnology | 2 |
| literary | 2 |
| literary_studies | 2 |
| literature_and_film | 2 |
| manufacturing | 2 |
| marketing,_business | 2 |
| mathematics_and_computer_science | 2 |
| mathematics_and_physics | 2 |
| mathematics_and_statistics | 2 |
| medical_imaging | 2 |
| metallurgy | 2 |
| military_drill | 2 |
| music_and_speech | 2 |
| music_history | 2 |
| music_theory | 2 |
| mythology_and_literature | 2 |
| nautical | 2 |
| nautical_engineering | 2 |
| neuroanatomy | 2 |
| neuropsychology | 2 |
| neuroscience_and_ophthalmology | 2 |
| nonequilibrium_thermodynamics | 2 |
| nutrition,_behavioral_science | 2 |
| occult | 2 |
| occult_studies | 2 |
| occupation | 2 |
| oenology | 2 |
| onoma | 2 |
| organic_chemistry | 2 |
| otolaryngology | 2 |
| ottoman_studies | 2 |
| packaging | 2 |
| paleontology | 2 |
| particle_physics | 2 |
| petroleum_refining | 2 |
| phonetics | 2 |
| photometry | 2 |
| physical_geography | 2 |
| physics_and_materials_science | 2 |
| physics_chemistry | 2 |
| physiology | 2 |
| poetry | 2 |
| political_geography | 2 |
| political_vocabulary | 2 |
| postal_services | 2 |
| prosody | 2 |
| religion_and_historical_language | 2 |
| science_and_engineering | 2 |
| sleep_disorders | 2 |
| sleep_medicine | 2 |
| sports_betting | 2 |
| statistics | 2 |
| symbolism | 2 |
| technology_and_media | 2 |
| technology_and_military | 2 |
| technology_and_neuroscience | 2 |
| textile | 2 |
| theatre_theory | 2 |
| transportation_engineering | 2 |
| travel | 2 |
| urban_planning | 2 |
| video_game_history | 2 |
| video_games | 2 |
| woodworking | 2 |
## Difficulty Distribution
| Difficulty | Count |
|---|---:|
| medium | 772,876 |
| easy | 205,956 |
| hard | 152,409 |
## Label Distribution
| Label | Count |
|---|---:|
| 0.20 | 566,913 |
| 0.35 | 152,409 |
| 0.80 | 205,963 |
| 1.00 | 205,956 |
## Recommended Use
- calibration supervision for embedding models
- reducing over-scoring of sibling concepts and same-domain wrong entities
- improving low-label score behavior in the 0.20 to 0.35 range
## Record Schema
Each record includes:
- `id`
- `text_a`
- `text_b`
- `label`
- `relation_type`
- `domain`
- `difficulty`
- `anchor_lexeme`
- `candidate_lexeme`
- `lexeme_id_a`
- `lexeme_id_b`
- optional `source`
- optional `notes`
## License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**.
## Related Datasets
- [OpenGloss v1.3 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-dictionary)
- [OpenGloss v1.3 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions)
- [OpenGloss v1.3 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-query-examples)
- [OpenGloss v1.3 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-contrastive-examples)
---
*Generated from the OpenGloss v1.3 lexicon.*
提供机构:
mjbommar



