five

bobox/STS_retrieval_dataset_HN_scored

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bobox/STS_retrieval_dataset_HN_scored
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - sentence-transformers - bge-m3 - retrieval - hard-negative-mining --- # Processed Dataset with Multi-Vector BGE-M3 Scores This dataset was built using a custom pipeline to format data for Sentence Transformers training. ## Source Dataset - **Original Dataset**: `custom-dataset` - **Split processed**: `train, eval` - **Anchor Column**: `anchor` - **Positive Column**: `positive` - **Negative Column**: `negative` ## Multi-similarity scoring (BAAI/bge-m3) The dataset includes similarity scores computed using `BAAI/bge-m3`. These scores include **Dense**, **Sparse (Lexical)**, and **Multi-vector (ColBERT-style, late interaction)** representations, plus an aggregate score weighted as `0.4 * dense + 0.2 * sparse + 0.4 * colbert`, as from BGE-M3 paper. - **Scoring Model**: `BAAI/bge-m3` - **Max Passage Length**: `4000` ### New Features Added: - `anchor_pos_aggregate_sim_m3`, `anchor_pos_list_sim_m3`, ## Hard Negative Mining ` - **Bi-Encoder (Mining Model)**: `Snowflake/snowflake-arctic-embed-m-v2.0` - **Negatives per positive**: `3` (in addition to existing negatives) - **Relative Margin applied**: `0.07` - **Absolute Margin applied**: `0.04` - **range min/max**: `2 - 22` - **negative score min/max**: `0.15 - 0.8` - **Sampling strategy**: `random` - `anchor_neg_aggregate_sim_m3`, `anchor_neg_list_sim_m3`, - `scores`: Bi-Encoder rescoring predictions from the hard-negative mining phase. ### Source Dataset subsets from: [bobox/training-dataset-temp](https://huggingface.co/datasets/bobox/training-dataset-temp) | Subset / Config Name | Train Samples | Eval Samples | |:---|---:|---:| | `eli5-1HN` | 35,000 | 320 | | `global-dataset-1HN` | 105,000 | 512 | | `natural-questions-1HN` | 35,000 | 320 | | `npr-1HN` | 35,000 | 320 | | `paws-1HN` | 21,829 | 3,539 | | `qnli-1HN` | 35,000 | 1,850 | | `sentence_compression-1HN` | 35,000 | 320 | | `vitaminc-1HN` | 35,000 | 387 | | `xsum-1HN` | 35,000 | 320 | | **TOTAL** | **371,829** | **7,888** |
提供机构:
bobox
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作