pyterrier/quora.pisa
收藏Hugging Face2024-10-08 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/pyterrier/quora.pisa
下载链接
链接失效反馈官方服务:
资源简介:
---
# pretty_name: "" # Example: "MS MARCO Terrier Index"
tags:
- pyterrier
- pyterrier-artifact
- pyterrier-artifact.sparse_index
- pyterrier-artifact.sparse_index.pisa
task_categories:
- text-retrieval
viewer: false
---
# quora.pisa
## Description
A PISA index for the Quora duplicate question dataset
## Usage
```python
# Load the artifact
import pyterrier as pt
index = pt.Artifact.from_hf('pyterrier/quora.pisa')
index.bm25() # returns a BM25 retriever
```
## Benchmarks
`quora/dev`
| name | nDCG@10 | R@1000 |
|:-------|----------:|---------:|
| bm25 | 0.7195 | 0.9845 |
| dph | 0.5893 | 0.9711 |
`quora/test`
| name | nDCG@10 | R@1000 |
|:-------|----------:|---------:|
| bm25 | 0.7122 | 0.9875 |
| dph | 0.5809 | 0.9729 |
## Reproduction
```python
import pyterrier as pt
from tqdm import tqdm
import ir_datasets
from pyterrier_pisa import PisaIndex
index = PisaIndex("quora.pisa", threads=16)
dataset = ir_datasets.load('beir/quora')
docs = ({'docno': d.doc_id, 'text': d.default_text()} for d in tqdm(dataset.docs))
index.index(docs)
```
## Metadata
```
{
"type": "sparse_index",
"format": "pisa",
"package_hint": "pyterrier-pisa",
"stemmer": "porter2"
}
```
# 展示名称: "" # 示例:"MS MARCO Terrier Index"
标签:
- pyterrier
- pyterrier-artifact
- pyterrier-artifact.sparse_index
- pyterrier-artifact.sparse_index.pisa
任务类别:
- 文本检索(text-retrieval)
可视化查看: false
## quora.pisa
## 描述
适用于Quora重复问题数据集的PISA索引
## 使用方法
python
# 加载该工件
import pyterrier as pt
index = pt.Artifact.from_hf('pyterrier/quora.pisa')
index.bm25() # 返回一个BM25检索器
## 基准测试
`quora/开发集`
| 方法名 | 归一化折损累积增益@10(nDCG@10) | 召回率@1000(R@1000) |
|:-------|----------:|---------:|
| bm25 | 0.7195 | 0.9845 |
| dph | 0.5893 | 0.9711 |
`quora/测试集`
| 方法名 | 归一化折损累积增益@10(nDCG@10) | 召回率@1000(R@1000) |
|:-------|----------:|---------:|
| bm25 | 0.7122 | 0.9875 |
| dph | 0.5809 | 0.9729 |
## 复现方法
python
import pyterrier as pt
from tqdm import tqdm
import ir_datasets
from pyterrier_pisa import PisaIndex
index = PisaIndex("quora.pisa", threads=16)
dataset = ir_datasets.load('beir/quora')
docs = ({'docno': d.doc_id, 'text': d.default_text()} for d in tqdm(dataset.docs))
index.index(docs)
## 元数据
json
{
"类型": "稀疏索引(sparse_index)",
"格式": "PISA(pisa)",
"包提示": "pyterrier-pisa",
"词干提取器": "波特2词干提取器(Porter2)"
}
提供机构:
pyterrier



