stefan-jo/fiqa-train-mined-reranker-scores

Name: stefan-jo/fiqa-train-mined-reranker-scores
Creator: stefan-jo
Published: 2026-04-01 19:24:36
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/stefan-jo/fiqa-train-mined-reranker-scores

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: fiqa-2018-non-commercial pretty_name: FiQA Train Mined Reranker Scores task_categories: - text-retrieval tags: - retrieval - distillation - hard-negative-mining - reranker-scores language: - en size_categories: - 1K<n<10K --- # FiQA Train Mined Reranker Scores This repository contains a derived FiQA training dataset for retrieval and distillation experiments. It was constructed from the **FiQA training split** by mining candidate negatives and then assigning teacher scores with a reranker. ## Dataset Summary - Source dataset: FiQA-2018 train split in BEIR-style preprocessing - Released split: `train` - Number of rows: `5500` - Row schema: - `query_id` - `document_ids` - `scores` - `labels` Each final row contains a ranked candidate set with: - `1` positive document - `15` mined negative documents - `16` scored documents total ## Data Construction Process The mining and scoring pipeline follows the scripts in `examples/pooling/train/data/` and the paper *Learn to Pool: Lightweight Fine-Tuning for Flexible Multi-Vector Compression*. For each query, positive documents are taken from the FiQA train qrels. A pool of 100 hard negative candidates is then mined with [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-small-en-v1.5), and an additional pool of 30 random negatives is sampled from the remaining corpus. All candidate query-document pairs are then scored with [`BAAI/bge-reranker-v2-gemma`](https://huggingface.co/BAAI/bge-reranker-v2-gemma). Following the filtering strategy described in the paper, negatives with a reranker score higher than `95%` of the positive anchor score for that query are removed to reduce the chance of keeping false negatives. The final per-query candidate set is built from the remaining documents by keeping the top positive, the top-scoring hard negatives, additional medium negatives sampled from the remaining hard-negative pool, and random negatives sampled from the random pool. This yields a compact scored tuple with exactly `16` documents per query: `1` positive and `15` mined negatives spanning hard, medium, and random examples. ## Released Features The released `train` split stores: - `query_id`: query identifier - `document_ids`: ranked list of document identifiers - `scores`: reranker scores aligned with `document_ids` - `labels`: document roles aligned with `document_ids` and `scores`, one of: - `positive` - `hard_negative` - `medium_negative` - `random_negative` The `labels` column indicates how each document entered the final candidate set. This is useful if you want to distinguish stronger mined negatives from easier random negatives during analysis or downstream training. This release contains **ID-based training annotations**, not the full query and document texts. The `query_id` and `document_ids` fields are meant to be resolved against the corresponding FiQA train split in BEIR format. In the training pipeline, this is done by loading the BEIR train split, building: - a `query_id -> text` mapping from the query set - a `document_id -> text` mapping from the corpus and then applying those mappings to the scored tuples during training. See `examples/pooling/train/train_beir_colbert_distillation.py` for a concrete example using `load_beir_train_split(...)` together with `utils.KDProcessing(...)`. ## Intended Use This dataset is intended for: - retrieval training - knowledge distillation / teacher-student training - studying mined and reranker-scored candidate sets - reproducing the paper's FiQA experiments ## Provenance The released dataset was built using: - source data: FiQA-2018 train split - mining model: `BAAI/bge-small-en-v1.5` - reranker: `BAAI/bge-reranker-v2-gemma` ## License and Data Sources This repository is released under a **custom non-commercial notice** because it is derived from FiQA-2018 training data. The official FiQA-2018 source states that the relevant Opinion-based QA data are available only for non-commercial use. Relevant upstream components: - source dataset: FiQA-2018 / BEIR-style preprocessing - mining model: `BAAI/bge-small-en-v1.5` (`MIT`) - reranker: `BAAI/bge-reranker-v2-gemma` (`Apache-2.0`) Users should review and comply with the upstream FiQA terms in addition to this repository notice. ## References - Paper: [Learn to Pool: Lightweight Fine-Tuning for Flexible Multi-Vector Compression](https://stefan-jo.github.io/learn-to-pool/downloads/paper.pdf) - Code: [stefan-jo/pylate](https://github.com/stefan-jo/pylate)

提供机构：

stefan-jo

5,000+

优质数据集

54 个

任务类型

进入经典数据集