five

lightonai/embeddings-fine-tuning

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/embeddings-fine-tuning
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: documents features: - name: document_id dtype: int64 - name: document dtype: string splits: - name: fiqa num_bytes: 44966890 num_examples: 57599 - name: hotpotqa num_bytes: 1474468794 num_examples: 5220635 - name: msmarco num_bytes: 3089144932 num_examples: 8841661 - name: nq num_bytes: 3105999594 num_examples: 10120660 - name: fever num_bytes: 2880306808 num_examples: 5384865 - name: squadv2 num_bytes: 14541224 num_examples: 19029 - name: trivia num_bytes: 13228661481 num_examples: 20970784 download_size: 22328019465 dataset_size: 23838089723 - config_name: queries features: - name: query_id dtype: int64 - name: query dtype: string splits: - name: fiqa num_bytes: 405464 num_examples: 5500 - name: hotpotqa num_bytes: 9999569 num_examples: 85000 - name: msmarco num_bytes: 22742749 num_examples: 502939 - name: nq num_bytes: 18663008 num_examples: 307373 - name: fever num_bytes: 6541435 num_examples: 109810 - name: squadv2 num_bytes: 9184156 num_examples: 130217 - name: trivia num_bytes: 7297884 num_examples: 78785 download_size: 64492382 dataset_size: 74834265 - config_name: scores features: - name: query_id dtype: int64 - name: document_ids list: int64 - name: scores list: float64 splits: - name: fiqa num_bytes: 464644800 num_examples: 14166 - name: hotpotqa num_bytes: 5576000000 num_examples: 170000 - name: nq num_bytes: 4990356000 num_examples: 152145 - name: msmarco num_bytes: 17474232800 num_examples: 532751 - name: fever num_bytes: 4594689600 num_examples: 140082 - name: squadv2 num_bytes: 4272364000 num_examples: 130255 - name: trivia num_bytes: 24319100800 num_examples: 741436 download_size: 26855796621 dataset_size: 61691388000 configs: - config_name: documents data_files: - split: fiqa path: documents/fiqa-* - split: nq path: documents/nq-* - split: hotpotqa path: documents/hotpotqa-* - split: msmarco path: documents/msmarco-* - split: fever path: documents/fever-* - split: squadv2 path: documents/squadv2-* - split: trivia path: documents/trivia-* - config_name: queries data_files: - split: fiqa path: queries/fiqa-* - split: nq path: queries/nq-* - split: hotpotqa path: queries/hotpotqa-* - split: msmarco path: queries/msmarco-* - split: fever path: queries/fever-* - split: squadv2 path: queries/squadv2-* - split: trivia path: queries/trivia-* - config_name: scores data_files: - split: fiqa path: scores/fiqa-* - split: hotpotqa path: scores/hotpotqa-* - split: nq path: scores/nq-* - split: msmarco path: scores/msmarco-* - split: fever path: scores/fever-* - split: squadv2 path: scores/squadv2-* - split: trivia path: scores/trivia-* --- ## Overview This dataset is composed of high quality data sources with mined hard negatives. It can be used to train a strong retrieval model by itself but is better used after a large-scale contrastive pre-training, for example using this [dataset](https://huggingface.co/datasets/lightonai/embeddings-pre-training) or its [curated version](https://huggingface.co/datasets/lightonai/mgte-en). This dataset has originally been created to follow the nv-retrieve setup, that mines the closest negatives to the query in a dataset and filter false negatives if their bi-encoder similarity is higher than a percentage of the query-positive similarity score. To allow the exploration of various threshold and sampling methods, we decided, as for our pre-training datasets, to be the least destructive possible. Thus, instead of giving the final filtered samples given a method/threshold, we share all of the data, including all the (2048) mined negatives alongside their scores so anyone can apply their own strategy before training easily. The mined datasets are FiQa, NaturalQuestion, HotpotQA, MSMARCO, FEVER, SquadV2 and TriviaQA, for a total of 1.88M queries with 2048 mined negatives and their scores, alongside the positive. The model used for mining is [gte-modernbert](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) For more information, please refer to our [blogpost.](https://huggingface.co/blog/lightonai/lateon) ## How to use If you want to directly use the data as contrastive data with nv-retrieve filtering in either [sentence-transformers](https://www.sbert.net) or [PyLate](https://lightonai.github.io/pylate/), you can simply map it to the `(query, positive, negative_1, negative_2, ..., negative_n)` like so: <details> <summary> Python code to cast to contrastive format </summary> ```python import datasets import os class KDToContrastive: """Dataset processing class for converting a KD dataset into a contrastive one. Parameters ---------- queries Queries dataset. documents Documents dataset. split Split to use for the queries and documents datasets. Used only if the queries and documents are of type `datasets.DatasetDict`. num_negatives Number of negatives to keep. nv_threshold Threshold for the nv-embed filtering """ def __init__( self, queries: datasets.Dataset | datasets.DatasetDict, documents: datasets.Dataset | datasets.DatasetDict, split: str = "train", num_negatives: int = 32, nv_threshold: float = 0.95, ) -> None: if isinstance(queries, datasets.DatasetDict): self.queries = queries[split] else: self.queries = queries if isinstance(documents, datasets.DatasetDict): self.documents = documents[split] else: self.documents = documents self.num_negatives = num_negatives self.nv_threshold = nv_threshold self.queries_index = { query_id: i for i, query_id in enumerate(iterable=self.queries["query_id"]) } self.documents_index = { document_id: i for i, document_id in enumerate(iterable=self.documents["document_id"]) } def has_enough_negatives(self, example): """Check if example has at least 50 valid negatives""" scores = example["scores"] positive_score = scores[0] count = sum( 1 for score in scores[1:] if score < self.nv_threshold * positive_score ) return count >= self.num_negatives def map_to_query_positive_negatives(self, example): """ Maps a scores example to the desired format: query, positive, negative_0, negative_1, ..., negative_49 """ query_id = example["query_id"] document_ids = example["document_ids"] scores = example["scores"] # Get query text query_text = self.queries[self.queries_index[query_id]] # First document_id is the positive positive_id = document_ids[0] positive_text = self.documents[self.documents_index[positive_id]] positive_score = scores[0] # Create the row row = {"query": query_text, "positive": positive_text} # Add negatives (starting from index 1) total_negatives = 0 for i in range(1, len(document_ids)): if scores[i] < self.nv_threshold * positive_score: negative_id = document_ids[i] row[f"negative_{total_negatives}"] = self.documents[ self.documents_index[negative_id] ] total_negatives += 1 if total_negatives >= self.num_negatives: break return row def load_train_datasets(): """Load all available splits from raphael data, with caching""" cache_dir = "nv_retrieve_99_50_cached" os.makedirs(cache_dir, exist_ok=True) train_dataset = datasets.DatasetDict() splits = ["trivia", "hotpotqa", "nq", "msmarco", "fever", "squadv2", "fiqa"] for split in splits: try: dataset = datasets.Dataset.load_from_disk(f"{cache_dir}/{split}") print("Loaded dataset from disk") except FileNotFoundError: print("Creating dataset") dataset = datasets.load_dataset( "lightonai/nv-embed-supervised-distill-dedup", name="scores", num_proc=144, split=split, ) queries = datasets.load_dataset( "lightonai/nv-embed-supervised-distill-dedup", name="queries", num_proc=144, split=split, ) documents = datasets.load_dataset( "lightonai/nv-embed-supervised-distill-dedup", name="documents", num_proc=144, split=split, ) processor = KDToContrastive( queries, documents, num_negatives=50, nv_threshold=0.99 ) dataset = dataset.filter( processor.has_enough_negatives, desc="Filtering examples with <50 negatives", ).map( processor.map_to_query_positive_negatives, remove_columns=dataset.column_names, desc="Creating query-positive-negatives dataset", ) dataset.save_to_disk(f"{cache_dir}/{split}") train_dataset[split] = dataset return train_dataset ``` </details> ## Dataset structure The dataset is composed of 7 high quality datasets, defined by the `splits` parameters. Each split contains 3 `subsets`, one containing the queries, one containing the documents and one joining tables also containing the corresponding pairwise query-documents scores. ### Documents | Column | Type | Description | |---------------|--------|--------------------------------------------------------------| | `document_id` | int64 | Unique identifier of the document within the split. | | `document` | string | Raw text of the document/passage. | | Split | Rows | |----------|-------:| | fiqa | 57.6k | | nq | 10.1M | | hotpotqa | 5.22M | | msmarco | 8.84M | | fever | 5.38M | | squadv2 | 19k | | trivia | 21M | | **Total**| **50.64M** | ### Queries | Column | Type | Description | |------------|--------|------------------------------------------------------| | `query_id` | int64 | Unique identifier of the query within the split. | | `query` | string | Raw text of the query. | | Split | Rows | |----------|-------:| | fiqa | 5.5k | | nq | 307k | | hotpotqa | 85k | | msmarco | 503k | | fever | 110k | | squadv2 | 130k | | trivia | 78.8k | | **Total**| **1.22M** | ### Scores | Column | Type | Description | |----------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `query_id` | int64 | Identifier joining back to the corresponding row in `queries`. | | `document_ids` | list[int64] | List of document IDs (joining back to `documents`). The first element is the positive document, followed by the top-2048 mined for the query. | | `scores` | list[float] | Relevance scores for each document w.r.t the query. The first element is the positive document, followed by the top-2048 mined for the query. Can be used for nv-retrieve filtering or knowledge distillation. | | Split | Rows | |----------|-------:| | fiqa | 14.2k | | hotpotqa | 170k | | nq | 152k | | msmarco | 533k | | fever | 140k | | squadv2 | 130k | | trivia | 741k | | **Total**| **1.88M** | ## Citation If you are using this dataset, please consider citing our work ```bibtex @misc{sourty2025denseonlateon, title={DenseOn with LateOn: Open State-of-the-Art Single and Multi-Vector Models}, author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Demoura, Paulo and Chatelain, Amelie}, year={2026}, howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}}, }```
提供机构:
lightonai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作