query2doc_msmarco
收藏魔搭社区2025-12-05 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/intfloat/query2doc_msmarco
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Summary
This dataset contains GPT-3.5 (`text-davinci-003`) generations from MS-MARCO queries.
[Query2doc: Query Expansion with Large Language Models](https://arxiv.org/pdf/2303.07678.pdf) Liang Wang, Nan Yang and Furu Wei
### Data Instances
An example looks as follows.
```
{
"query_id": "1030303",
"query": "who is aziz hashim",
"pseudo_doc": "Aziz Hashim is a renowned entrepreneur, business leader, and one of the most successful restaurant franchise operators in the US. He is the founder of NRD Capital, a private equity firm focused on investments in multi-unit restaurant franchised businesses. Hashim has built a formidable track record of success in the franchise industry, with brands such as Outback Steakhouse and Jamba Juice. His accomplishments and philanthropic initiatives have earned him numerous awards, including the prestigious Ernst and Young Entrepreneur of the Year award."
}
```
### Data Fields
- `query_id`: a `string` feature.
- `query`: a `string` feature.
- `pseudo_doc`: a `string` feature.
### Data Splits
| train | dev | test | trec_dl2019 | trec_dl2020 |
|--------|------:|------:|------:|------:|
| 502939 | 6980 | 6837 | 43 | 54 |
### How to use this dataset
```python
from datasets import load_dataset
dataset = load_dataset('intfloat/query2doc_msmarco')
print(dataset['trec_dl2019'][0])
```
### Reproducing our results
We provide a python script [repro_bm25.py](https://huggingface.co/datasets/intfloat/query2doc_msmarco/blob/main/repro_bm25.py) to reproduce our results with BM25 retrieval.
First install some python dependency packages:
```
pip install pyserini==0.15.0 pytrec_eval datasets tqdm
```
Then download and run the python code:
```
python repro_bm25.py
```
This script utilizes the pre-built Lucene index from [Pyserini](https://github.com/castorini/pyserini/blob/pyserini-0.15.0/docs/prebuilt-indexes.md)
and might yield slightly different results compared to the paper.
### Citation Information
```
@article{wang2023query2doc,
title={Query2doc: Query Expansion with Large Language Models},
author={Wang, Liang and Yang, Nan and Wei, Furu},
journal={arXiv preprint arXiv:2303.07678},
year={2023}
}
```
### 数据集概述
本数据集收录了基于MS-MARCO查询由GPT-3.5(`text-davinci-003`)生成的文本。
参考论文《Query2doc:基于大语言模型(Large Language Model)的查询扩展》,作者为王亮、杨楠及魏福如,论文链接:https://arxiv.org/pdf/2303.07678.pdf
### 数据样例
示例格式如下:
{
"query_id": "1030303",
"query": "who is aziz hashim",
"pseudo_doc": "Aziz Hashim is a renowned entrepreneur, business leader, and one of the most successful restaurant franchise operators in the US. He is the founder of NRD Capital, a private equity firm focused on investments in multi-unit restaurant franchised businesses. Hashim has built a formidable track record of success in the franchise industry, with brands such as Outback Steakhouse and Jamba Juice. His accomplishments and philanthropic initiatives have earned him numerous awards, including the prestigious Ernst and Young Entrepreneur of the Year award."
}
### 数据字段
- `query_id`:字符串类型特征。
- `query`:字符串类型特征。
- `pseudo_doc`:字符串类型特征。
### 数据划分
| 训练集 | 验证集 | 测试集 | trec_dl2019 | trec_dl2020 |
|--------|-------:|-------:|-------:|-------:|
| 502939 | 6980 | 6837 | 43 | 54 |
### 数据集使用方法
python
from datasets import load_dataset
dataset = load_dataset('intfloat/query2doc_msmarco')
print(dataset['trec_dl2019'][0])
### 结果复现
我们提供了Python脚本[repro_bm25.py](https://huggingface.co/datasets/intfloat/query2doc_msmarco/blob/main/repro_bm25.py),可用于通过BM25检索复现本文的实验结果。
首先安装所需的Python依赖包:
pip install pyserini==0.15.0 pytrec_eval datasets tqdm
随后下载并运行该Python脚本:
python repro_bm25.py
该脚本使用了来自[Pyserini](https://github.com/castorini/pyserini/blob/pyserini-0.15.0/docs/prebuilt-indexes.md)的预构建Lucene索引,最终实验结果可能与论文存在细微差异。
### 引用信息
@article{wang2023query2doc,
title={Query2doc: Query Expansion with Large Language Models},
author={Wang, Liang and Yang, Nan and Wei, Furu},
journal={arXiv preprint arXiv:2303.07678},
year={2023}
}
提供机构:
maas
创建时间:
2025-02-12



