five

macavaney/d2q-msmarco-passage-scores-tct

收藏
Hugging Face2022-12-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/macavaney/d2q-msmarco-passage-scores-tct
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language: [] language_creators: - machine-generated license: [] pretty_name: Doc2Query TCT Relevance Scores for `msmarco-passage` source_datasets: [msmarco-passage] tags: - document-expansion - doc2query-- task_categories: - text-retrieval task_ids: - document-retrieval viewer: false --- # Doc2Query TCT Relevance Scores for `msmarco-passage` This dataset provides the pre-computed query relevance scores for the [`msmarco-passage`](https://ir-datasets.com/msmarco-passage) dataset, for use with Doc2Query--. The generated queries come from [`macavaney/d2q-msmarco-passage`](https://huggingface.co/datasets/macavaney/d2q-msmarco-passage) and were scored with [`castorini/tct_colbert-v2-hnp-msmarco`](https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco). ## Getting started This artefact is meant to be used with the [`pyterrier_doc2query`](https://github.com/terrierteam/pyterrier_doc2query) pacakge. It can be installed as: ```bash pip install git+https://github.com/terrierteam/pyterrier_doc2query ``` Depending on what you are using this aretefact for, you may also need the following additional packages: ```bash pip install git+https://github.com/terrierteam/pyterrier_pisa # for indexing / retrieval pip install git+https://github.com/terrierteam/pyterrier_dr # for reproducing this aretefact ``` ## Using this artefact The main use case is to use this aretefact in a Doc2Query−− indexing pipeline: ```python import pyterrier as pt ; pt.init() from pyterrier_pisa import PisaIndex from pyterrier_doc2query import QueryScoreStore, QueryFilter store = QueryScoreStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage-scores-tct') index = PisaIndex('path/to/index') pipeline = store.query_scorer(limit_k=40) >> QueryFilter(t=store.percentile(70)) >> index dataset = pt.get_dataset('irds:msmarco-passage') pipeline.index(dataset.get_corpus_iter()) ``` You can also use the store directly as a dataset to look up or iterate over the data: ```python store.lookup('100') # {'querygen': ..., 'querygen_store': ...} for record in store: pass ``` ## Reproducing this aretefact This aretefact can be reproduced using the following pipeline: ```python import pyterrier as pt ; pt.init() from pyterrier_dr import TctColBert from pyterrier_doc2query import Doc2QueryStore, QueryScoreStore, QueryScorer doc2query_generator = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage').generator() store = QueryScoreStore('path/to/store') pipeline = doc2query_generator >> QueryScorer(TctColBert('castorini/tct_colbert-v2-hnp-msmarco')) >> store dataset = pt.get_dataset('irds:msmarco-passage') pipeline.index(dataset.get_corpus_iter()) ``` Note that this process will take quite some time; it computes the relevance score for 80 generated queries for every document in the dataset.
提供机构:
macavaney
原始信息汇总

数据集概述

数据集名称

  • 名称: Doc2Query TCT Relevance Scores for msmarco-passage

数据集描述

  • 描述: 本数据集提供msmarco-passage数据集的预计算查询相关性分数,用于Doc2Query--。

数据来源

  • 源数据集: msmarco-passage

生成查询来源

  • 查询生成: 来自macavaney/d2q-msmarco-passage
  • 评分模型: 使用castorini/tct_colbert-v2-hnp-msmarco进行评分

使用场景

  • 主要用途: 用于Doc2Query--索引构建流程
  • 示例代码: 使用pyterrier_doc2query包进行数据集处理和索引构建

数据集操作

  • 查询操作: 可以直接查询或迭代数据集中的数据
  • 数据集复现: 可通过特定流程复现数据集的生成过程

注意事项

  • 处理时间: 数据集生成过程耗时较长,需计算每个文档的80个生成查询的相关性分数
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作