macavaney/d2q-msmarco-passage-scores-tct
收藏Hugging Face2022-12-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/macavaney/d2q-msmarco-passage-scores-tct
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language: []
language_creators:
- machine-generated
license: []
pretty_name: Doc2Query TCT Relevance Scores for `msmarco-passage`
source_datasets: [msmarco-passage]
tags:
- document-expansion
- doc2query--
task_categories:
- text-retrieval
task_ids:
- document-retrieval
viewer: false
---
# Doc2Query TCT Relevance Scores for `msmarco-passage`
This dataset provides the pre-computed query relevance scores for the [`msmarco-passage`](https://ir-datasets.com/msmarco-passage) dataset,
for use with Doc2Query--.
The generated queries come from [`macavaney/d2q-msmarco-passage`](https://huggingface.co/datasets/macavaney/d2q-msmarco-passage) and
were scored with [`castorini/tct_colbert-v2-hnp-msmarco`](https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco).
## Getting started
This artefact is meant to be used with the [`pyterrier_doc2query`](https://github.com/terrierteam/pyterrier_doc2query) pacakge. It can
be installed as:
```bash
pip install git+https://github.com/terrierteam/pyterrier_doc2query
```
Depending on what you are using this aretefact for, you may also need the following additional packages:
```bash
pip install git+https://github.com/terrierteam/pyterrier_pisa # for indexing / retrieval
pip install git+https://github.com/terrierteam/pyterrier_dr # for reproducing this aretefact
```
## Using this artefact
The main use case is to use this aretefact in a Doc2Query−− indexing pipeline:
```python
import pyterrier as pt ; pt.init()
from pyterrier_pisa import PisaIndex
from pyterrier_doc2query import QueryScoreStore, QueryFilter
store = QueryScoreStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage-scores-tct')
index = PisaIndex('path/to/index')
pipeline = store.query_scorer(limit_k=40) >> QueryFilter(t=store.percentile(70)) >> index
dataset = pt.get_dataset('irds:msmarco-passage')
pipeline.index(dataset.get_corpus_iter())
```
You can also use the store directly as a dataset to look up or iterate over the data:
```python
store.lookup('100')
# {'querygen': ..., 'querygen_store': ...}
for record in store:
pass
```
## Reproducing this aretefact
This aretefact can be reproduced using the following pipeline:
```python
import pyterrier as pt ; pt.init()
from pyterrier_dr import TctColBert
from pyterrier_doc2query import Doc2QueryStore, QueryScoreStore, QueryScorer
doc2query_generator = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage').generator()
store = QueryScoreStore('path/to/store')
pipeline = doc2query_generator >> QueryScorer(TctColBert('castorini/tct_colbert-v2-hnp-msmarco')) >> store
dataset = pt.get_dataset('irds:msmarco-passage')
pipeline.index(dataset.get_corpus_iter())
```
Note that this process will take quite some time; it computes the relevance score for 80 generated queries
for every document in the dataset.
提供机构:
macavaney
原始信息汇总
数据集概述
数据集名称
- 名称: Doc2Query TCT Relevance Scores for
msmarco-passage
数据集描述
- 描述: 本数据集提供
msmarco-passage数据集的预计算查询相关性分数,用于Doc2Query--。
数据来源
- 源数据集:
msmarco-passage
生成查询来源
- 查询生成: 来自
macavaney/d2q-msmarco-passage - 评分模型: 使用
castorini/tct_colbert-v2-hnp-msmarco进行评分
使用场景
- 主要用途: 用于Doc2Query--索引构建流程
- 示例代码: 使用
pyterrier_doc2query包进行数据集处理和索引构建
数据集操作
- 查询操作: 可以直接查询或迭代数据集中的数据
- 数据集复现: 可通过特定流程复现数据集的生成过程
注意事项
- 处理时间: 数据集生成过程耗时较长,需计算每个文档的80个生成查询的相关性分数



