lightonai/embeddings-fine-tuning
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/embeddings-fine-tuning
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: documents
features:
- name: document_id
dtype: int64
- name: document
dtype: string
splits:
- name: fiqa
num_bytes: 44966890
num_examples: 57599
- name: hotpotqa
num_bytes: 1474468794
num_examples: 5220635
- name: msmarco
num_bytes: 3089144932
num_examples: 8841661
- name: nq
num_bytes: 3105999594
num_examples: 10120660
- name: fever
num_bytes: 2880306808
num_examples: 5384865
- name: squadv2
num_bytes: 14541224
num_examples: 19029
- name: trivia
num_bytes: 13228661481
num_examples: 20970784
download_size: 22328019465
dataset_size: 23838089723
- config_name: queries
features:
- name: query_id
dtype: int64
- name: query
dtype: string
splits:
- name: fiqa
num_bytes: 405464
num_examples: 5500
- name: hotpotqa
num_bytes: 9999569
num_examples: 85000
- name: msmarco
num_bytes: 22742749
num_examples: 502939
- name: nq
num_bytes: 18663008
num_examples: 307373
- name: fever
num_bytes: 6541435
num_examples: 109810
- name: squadv2
num_bytes: 9184156
num_examples: 130217
- name: trivia
num_bytes: 7297884
num_examples: 78785
download_size: 64492382
dataset_size: 74834265
- config_name: scores
features:
- name: query_id
dtype: int64
- name: document_ids
list: int64
- name: scores
list: float64
splits:
- name: fiqa
num_bytes: 464644800
num_examples: 14166
- name: hotpotqa
num_bytes: 5576000000
num_examples: 170000
- name: nq
num_bytes: 4990356000
num_examples: 152145
- name: msmarco
num_bytes: 17474232800
num_examples: 532751
- name: fever
num_bytes: 4594689600
num_examples: 140082
- name: squadv2
num_bytes: 4272364000
num_examples: 130255
- name: trivia
num_bytes: 24319100800
num_examples: 741436
download_size: 26855796621
dataset_size: 61691388000
configs:
- config_name: documents
data_files:
- split: fiqa
path: documents/fiqa-*
- split: nq
path: documents/nq-*
- split: hotpotqa
path: documents/hotpotqa-*
- split: msmarco
path: documents/msmarco-*
- split: fever
path: documents/fever-*
- split: squadv2
path: documents/squadv2-*
- split: trivia
path: documents/trivia-*
- config_name: queries
data_files:
- split: fiqa
path: queries/fiqa-*
- split: nq
path: queries/nq-*
- split: hotpotqa
path: queries/hotpotqa-*
- split: msmarco
path: queries/msmarco-*
- split: fever
path: queries/fever-*
- split: squadv2
path: queries/squadv2-*
- split: trivia
path: queries/trivia-*
- config_name: scores
data_files:
- split: fiqa
path: scores/fiqa-*
- split: hotpotqa
path: scores/hotpotqa-*
- split: nq
path: scores/nq-*
- split: msmarco
path: scores/msmarco-*
- split: fever
path: scores/fever-*
- split: squadv2
path: scores/squadv2-*
- split: trivia
path: scores/trivia-*
---
## Overview
This dataset is composed of high quality data sources with mined hard negatives. It can be used to train a strong retrieval model by itself but is better used after a large-scale contrastive pre-training, for example using this [dataset](https://huggingface.co/datasets/lightonai/embeddings-pre-training) or its [curated version](https://huggingface.co/datasets/lightonai/mgte-en).
This dataset has originally been created to follow the nv-retrieve setup, that mines the closest negatives to the query in a dataset and filter false negatives if their bi-encoder similarity is higher than a percentage of the query-positive similarity score.
To allow the exploration of various threshold and sampling methods, we decided, as for our pre-training datasets, to be the least destructive possible. Thus, instead of giving the final filtered samples given a method/threshold, we share all of the data, including all the (2048) mined negatives alongside their scores so anyone can apply their own strategy before training easily.
The mined datasets are FiQa, NaturalQuestion, HotpotQA, MSMARCO, FEVER, SquadV2 and TriviaQA, for a total of 1.88M queries with 2048 mined negatives and their scores, alongside the positive. The model used for mining is [gte-modernbert](https://huggingface.co/Alibaba-NLP/gte-modernbert-base)
For more information, please refer to our [blogpost.](https://huggingface.co/blog/lightonai/lateon)
## How to use
If you want to directly use the data as contrastive data with nv-retrieve filtering in either [sentence-transformers](https://www.sbert.net) or [PyLate](https://lightonai.github.io/pylate/), you can simply map it to the `(query, positive, negative_1, negative_2, ..., negative_n)` like so:
<details>
<summary>
Python code to cast to contrastive format
</summary>
```python
import datasets
import os
class KDToContrastive:
"""Dataset processing class for converting a KD dataset into a contrastive one.
Parameters
----------
queries
Queries dataset.
documents
Documents dataset.
split
Split to use for the queries and documents datasets. Used only if the queries and documents are of type `datasets.DatasetDict`.
num_negatives
Number of negatives to keep.
nv_threshold
Threshold for the nv-embed filtering
"""
def __init__(
self,
queries: datasets.Dataset | datasets.DatasetDict,
documents: datasets.Dataset | datasets.DatasetDict,
split: str = "train",
num_negatives: int = 32,
nv_threshold: float = 0.95,
) -> None:
if isinstance(queries, datasets.DatasetDict):
self.queries = queries[split]
else:
self.queries = queries
if isinstance(documents, datasets.DatasetDict):
self.documents = documents[split]
else:
self.documents = documents
self.num_negatives = num_negatives
self.nv_threshold = nv_threshold
self.queries_index = {
query_id: i for i, query_id in enumerate(iterable=self.queries["query_id"])
}
self.documents_index = {
document_id: i
for i, document_id in enumerate(iterable=self.documents["document_id"])
}
def has_enough_negatives(self, example):
"""Check if example has at least 50 valid negatives"""
scores = example["scores"]
positive_score = scores[0]
count = sum(
1 for score in scores[1:] if score < self.nv_threshold * positive_score
)
return count >= self.num_negatives
def map_to_query_positive_negatives(self, example):
"""
Maps a scores example to the desired format:
query, positive, negative_0, negative_1, ..., negative_49
"""
query_id = example["query_id"]
document_ids = example["document_ids"]
scores = example["scores"]
# Get query text
query_text = self.queries[self.queries_index[query_id]]
# First document_id is the positive
positive_id = document_ids[0]
positive_text = self.documents[self.documents_index[positive_id]]
positive_score = scores[0]
# Create the row
row = {"query": query_text, "positive": positive_text}
# Add negatives (starting from index 1)
total_negatives = 0
for i in range(1, len(document_ids)):
if scores[i] < self.nv_threshold * positive_score:
negative_id = document_ids[i]
row[f"negative_{total_negatives}"] = self.documents[
self.documents_index[negative_id]
]
total_negatives += 1
if total_negatives >= self.num_negatives:
break
return row
def load_train_datasets():
"""Load all available splits from raphael data, with caching"""
cache_dir = "nv_retrieve_99_50_cached"
os.makedirs(cache_dir, exist_ok=True)
train_dataset = datasets.DatasetDict()
splits = ["trivia", "hotpotqa", "nq", "msmarco", "fever", "squadv2", "fiqa"]
for split in splits:
try:
dataset = datasets.Dataset.load_from_disk(f"{cache_dir}/{split}")
print("Loaded dataset from disk")
except FileNotFoundError:
print("Creating dataset")
dataset = datasets.load_dataset(
"lightonai/nv-embed-supervised-distill-dedup",
name="scores",
num_proc=144,
split=split,
)
queries = datasets.load_dataset(
"lightonai/nv-embed-supervised-distill-dedup",
name="queries",
num_proc=144,
split=split,
)
documents = datasets.load_dataset(
"lightonai/nv-embed-supervised-distill-dedup",
name="documents",
num_proc=144,
split=split,
)
processor = KDToContrastive(
queries, documents, num_negatives=50, nv_threshold=0.99
)
dataset = dataset.filter(
processor.has_enough_negatives,
desc="Filtering examples with <50 negatives",
).map(
processor.map_to_query_positive_negatives,
remove_columns=dataset.column_names,
desc="Creating query-positive-negatives dataset",
)
dataset.save_to_disk(f"{cache_dir}/{split}")
train_dataset[split] = dataset
return train_dataset
```
</details>
## Dataset structure
The dataset is composed of 7 high quality datasets, defined by the `splits` parameters.
Each split contains 3 `subsets`, one containing the queries, one containing the documents and one joining tables also containing the corresponding pairwise query-documents scores.
### Documents
| Column | Type | Description |
|---------------|--------|--------------------------------------------------------------|
| `document_id` | int64 | Unique identifier of the document within the split. |
| `document` | string | Raw text of the document/passage. |
| Split | Rows |
|----------|-------:|
| fiqa | 57.6k |
| nq | 10.1M |
| hotpotqa | 5.22M |
| msmarco | 8.84M |
| fever | 5.38M |
| squadv2 | 19k |
| trivia | 21M |
| **Total**| **50.64M** |
### Queries
| Column | Type | Description |
|------------|--------|------------------------------------------------------|
| `query_id` | int64 | Unique identifier of the query within the split. |
| `query` | string | Raw text of the query. |
| Split | Rows |
|----------|-------:|
| fiqa | 5.5k |
| nq | 307k |
| hotpotqa | 85k |
| msmarco | 503k |
| fever | 110k |
| squadv2 | 130k |
| trivia | 78.8k |
| **Total**| **1.22M** |
### Scores
| Column | Type | Description |
|----------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `query_id` | int64 | Identifier joining back to the corresponding row in `queries`. |
| `document_ids` | list[int64] | List of document IDs (joining back to `documents`). The first element is the positive document, followed by the top-2048 mined for the query. |
| `scores` | list[float] | Relevance scores for each document w.r.t the query. The first element is the positive document, followed by the top-2048 mined for the query. Can be used for nv-retrieve filtering or knowledge distillation. |
| Split | Rows |
|----------|-------:|
| fiqa | 14.2k |
| hotpotqa | 170k |
| nq | 152k |
| msmarco | 533k |
| fever | 140k |
| squadv2 | 130k |
| trivia | 741k |
| **Total**| **1.88M** |
## Citation
If you are using this dataset, please consider citing our work
```bibtex
@misc{sourty2025denseonlateon,
title={DenseOn with LateOn: Open State-of-the-Art Single and Multi-Vector Models},
author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Demoura, Paulo and Chatelain, Amelie},
year={2026},
howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
}```
提供机构:
lightonai



