bobox/STS_retrieval_dataset_HN_scored
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bobox/STS_retrieval_dataset_HN_scored
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- sentence-transformers
- bge-m3
- retrieval
- hard-negative-mining
---
# Processed Dataset with Multi-Vector BGE-M3 Scores
This dataset was built using a custom pipeline to format data for Sentence Transformers training.
## Source Dataset
- **Original Dataset**: `custom-dataset`
- **Split processed**: `train, eval`
- **Anchor Column**: `anchor`
- **Positive Column**: `positive`
- **Negative Column**: `negative`
## Multi-similarity scoring (BAAI/bge-m3)
The dataset includes similarity scores computed using `BAAI/bge-m3`.
These scores include **Dense**, **Sparse (Lexical)**, and **Multi-vector (ColBERT-style, late interaction)** representations, plus an aggregate score weighted as `0.4 * dense + 0.2 * sparse + 0.4 * colbert`, as from BGE-M3 paper.
- **Scoring Model**: `BAAI/bge-m3`
- **Max Passage Length**: `4000`
### New Features Added:
- `anchor_pos_aggregate_sim_m3`, `anchor_pos_list_sim_m3`,
## Hard Negative Mining
`
- **Bi-Encoder (Mining Model)**: `Snowflake/snowflake-arctic-embed-m-v2.0`
- **Negatives per positive**: `3` (in addition to existing negatives)
- **Relative Margin applied**: `0.07`
- **Absolute Margin applied**: `0.04`
- **range min/max**: `2 - 22`
- **negative score min/max**: `0.15 - 0.8`
- **Sampling strategy**: `random`
- `anchor_neg_aggregate_sim_m3`, `anchor_neg_list_sim_m3`,
- `scores`: Bi-Encoder rescoring predictions from the hard-negative mining phase.
### Source Dataset
subsets from: [bobox/training-dataset-temp](https://huggingface.co/datasets/bobox/training-dataset-temp)
| Subset / Config Name | Train Samples | Eval Samples |
|:---|---:|---:|
| `eli5-1HN` | 35,000 | 320 |
| `global-dataset-1HN` | 105,000 | 512 |
| `natural-questions-1HN` | 35,000 | 320 |
| `npr-1HN` | 35,000 | 320 |
| `paws-1HN` | 21,829 | 3,539 |
| `qnli-1HN` | 35,000 | 1,850 |
| `sentence_compression-1HN` | 35,000 | 320 |
| `vitaminc-1HN` | 35,000 | 387 |
| `xsum-1HN` | 35,000 | 320 |
| **TOTAL** | **371,829** | **7,888** |
提供机构:
bobox



