five

mjbommar/opengloss-v1.3-hard-negative-pairs

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/opengloss-v1.3-hard-negative-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - sentence-similarity - feature-extraction - text-classification language: - en tags: - opengloss - embeddings - hard-negatives - calibration - retrieval - semantic-similarity size_categories: - 10K<n<100K --- # OpenGloss Hard Negative Pairs v1.3 This dataset contains calibration-oriented positive and low-label similarity pairs for embedding training. It is designed to reduce over-scoring of related-but-wrong matches and improve score separation in weak domains. ## Dataset Summary - Total records: **1,131,241** - Unique lexemes: **205,967** ## Relation Distribution | Relation Type | Count | |---|---:| | same_domain_wrong_entity | 566,913 | | style_variant | 205,963 | | true_match | 205,956 | | near_fact_confusion | 137,232 | | sibling_concept | 15,177 | ## Domain Distribution | Domain | Count | |---|---:| | general | 737,516 | | history | 145,655 | | geography | 144,860 | | linguistics | 35,517 | | art | 10,134 | | religion | 9,736 | | science | 7,550 | | language | 7,265 | | education | 6,258 | | technology | 4,830 | | civics | 3,885 | | anthropology | 2,790 | | life-sciences | 2,350 | | society | 2,065 | | mathematics | 1,590 | | law | 1,440 | | economics | 1,124 | | arts | 1,025 | | biology | 782 | | literature | 690 | | philosophy | 610 | | food | 445 | | sports | 420 | | medicine | 375 | | chemistry | 310 | | zoology | 290 | | pharmacology | 170 | | politics | 160 | | botany | 110 | | anatomy | 105 | | physics | 65 | | music | 60 | | mineralogy | 55 | | entomology | 50 | | biochemistry | 45 | | taxonomy | 45 | | astronomy | 40 | | materials_science | 40 | | psychology | 35 | | genetics | 30 | | geology | 30 | | molecular_biology | 30 | | architecture | 25 | | ophthalmology | 25 | | ornithology | 25 | | computing | 20 | | construction | 20 | | ichthyology | 20 | | life_sciences | 20 | | mycology | 20 | | neuroscience | 20 | | textiles | 20 | | clothing | 12 | | agriculture | 6 | | archaeology | 6 | | biography | 6 | | bullfighting | 6 | | construction_engineering | 6 | | ecology | 6 | | economics_and_accounting | 6 | | endocrinology | 6 | | fantasy | 6 | | food_and_drink | 6 | | history_and_literature | 6 | | history_and_politics | 6 | | informal_speech | 6 | | law_/_civics | 6 | | legal_and_financial | 6 | | meteorology | 6 | | military | 6 | | mythology | 6 | | onomastics | 6 | | politics_and_law | 6 | | publishing | 6 | | rhetoric | 6 | | transport | 6 | | transportation | 6 | | typography | 6 | | psychiatry | 4 | | accounting_and_documentation | 2 | | acoustics | 2 | | administrative_communication | 2 | | aeronautics | 2 | | animal_husbandry | 2 | | architecture,_housing | 2 | | art_and_literature | 2 | | art_and_media | 2 | | art_history | 2 | | aviation | 2 | | aviation_security | 2 | | biology_and_medicine | 2 | | biology_genetics | 2 | | business | 2 | | business,_economics,_management | 2 | | cardiology | 2 | | cartography | 2 | | ceramics | 2 | | chronology | 2 | | classical_studies | 2 | | color | 2 | | comics | 2 | | communication | 2 | | crafts | 2 | | crafts_and_restoration | 2 | | cryptography | 2 | | culinary | 2 | | dance | 2 | | domestic_service | 2 | | earth_science | 2 | | ecology_and_zoology | 2 | | economics,_operations_management | 2 | | economics_and_law | 2 | | electrical_engineering | 2 | | electronics | 2 | | embryology | 2 | | environmental_science | 2 | | equine | 2 | | fashion | 2 | | film_and_media | 2 | | finance | 2 | | finance_technology | 2 | | food_and_language | 2 | | food_preparation | 2 | | food_science | 2 | | forestry | 2 | | funerary | 2 | | furniture_and_restoration | 2 | | government | 2 | | graph_theory | 2 | | heraldry | 2 | | historical_administration | 2 | | historical_and_honorific_usage | 2 | | historical_and_institutional | 2 | | historical_architecture | 2 | | historical_social_welfare | 2 | | historical_transport | 2 | | historical_weapons | 2 | | history_and_classical_mythology | 2 | | history_and_geography | 2 | | history_and_philosophy | 2 | | human-computer_interaction,_interface_design | 2 | | kinship | 2 | | language_and_general | 2 | | language_and_printing | 2 | | language_and_visual_perception | 2 | | law,_finance,_administration | 2 | | law,_politics | 2 | | law/politics | 2 | | law_and_civics | 2 | | law_and_politics | 2 | | law_enforcement | 2 | | legal_and_general | 2 | | legal_and_general_use | 2 | | legal_and_political | 2 | | limnology | 2 | | literary | 2 | | literary_studies | 2 | | literature_and_film | 2 | | manufacturing | 2 | | marketing,_business | 2 | | mathematics_and_computer_science | 2 | | mathematics_and_physics | 2 | | mathematics_and_statistics | 2 | | medical_imaging | 2 | | metallurgy | 2 | | military_drill | 2 | | music_and_speech | 2 | | music_history | 2 | | music_theory | 2 | | mythology_and_literature | 2 | | nautical | 2 | | nautical_engineering | 2 | | neuroanatomy | 2 | | neuropsychology | 2 | | neuroscience_and_ophthalmology | 2 | | nonequilibrium_thermodynamics | 2 | | nutrition,_behavioral_science | 2 | | occult | 2 | | occult_studies | 2 | | occupation | 2 | | oenology | 2 | | onoma | 2 | | organic_chemistry | 2 | | otolaryngology | 2 | | ottoman_studies | 2 | | packaging | 2 | | paleontology | 2 | | particle_physics | 2 | | petroleum_refining | 2 | | phonetics | 2 | | photometry | 2 | | physical_geography | 2 | | physics_and_materials_science | 2 | | physics_chemistry | 2 | | physiology | 2 | | poetry | 2 | | political_geography | 2 | | political_vocabulary | 2 | | postal_services | 2 | | prosody | 2 | | religion_and_historical_language | 2 | | science_and_engineering | 2 | | sleep_disorders | 2 | | sleep_medicine | 2 | | sports_betting | 2 | | statistics | 2 | | symbolism | 2 | | technology_and_media | 2 | | technology_and_military | 2 | | technology_and_neuroscience | 2 | | textile | 2 | | theatre_theory | 2 | | transportation_engineering | 2 | | travel | 2 | | urban_planning | 2 | | video_game_history | 2 | | video_games | 2 | | woodworking | 2 | ## Difficulty Distribution | Difficulty | Count | |---|---:| | medium | 772,876 | | easy | 205,956 | | hard | 152,409 | ## Label Distribution | Label | Count | |---|---:| | 0.20 | 566,913 | | 0.35 | 152,409 | | 0.80 | 205,963 | | 1.00 | 205,956 | ## Recommended Use - calibration supervision for embedding models - reducing over-scoring of sibling concepts and same-domain wrong entities - improving low-label score behavior in the 0.20 to 0.35 range ## Record Schema Each record includes: - `id` - `text_a` - `text_b` - `label` - `relation_type` - `domain` - `difficulty` - `anchor_lexeme` - `candidate_lexeme` - `lexeme_id_a` - `lexeme_id_b` - optional `source` - optional `notes` ## License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY 4.0)**. ## Related Datasets - [OpenGloss v1.3 Dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-dictionary) - [OpenGloss v1.3 Definitions](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-definitions) - [OpenGloss v1.3 Query Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-query-examples) - [OpenGloss v1.3 Contrastive Examples](https://huggingface.co/datasets/mjbommar/opengloss-v1.3-contrastive-examples) --- *Generated from the OpenGloss v1.3 lexicon.*
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作