---
pretty_name: '`wikiclir/zh`'
viewer: false
source_datasets: []
task_categories:
- text-retrieval
---
# Dataset Card for `wikiclir/zh`
The `wikiclir/zh` dataset, provided by the [ir-datasets](https://ir-datasets.com/) package.
For more information about the dataset, see the [documentation](https://ir-datasets.com/wikiclir#wikiclir/zh).
# Data
This dataset provides:
- `docs` (documents, i.e., the corpus); count=951,480
- `queries` (i.e., topics); count=463,273
- `qrels`: (relevance assessments); count=926,130
## Usage
```python
from datasets import load_dataset
docs = load_dataset('irds/wikiclir_zh', 'docs')
for record in docs:
record # {'doc_id': ..., 'title': ..., 'text': ...}
queries = load_dataset('irds/wikiclir_zh', 'queries')
for record in queries:
record # {'query_id': ..., 'text': ...}
qrels = load_dataset('irds/wikiclir_zh', 'qrels')
for record in qrels:
record # {'query_id': ..., 'doc_id': ..., 'relevance': ..., 'iteration': ...}
```
Note that calling `load_dataset` will download the dataset (or provide access instructions when it's not public) and make a copy of the
data in 🤗 Dataset format.
## Citation Information
```
@inproceedings{sasaki-etal-2018-cross,
title = "Cross-Lingual Learning-to-Rank with Shared Representations",
author = "Sasaki, Shota and
Sun, Shuo and
Schamoni, Shigehiko and
Duh, Kevin and
Inui, Kentaro",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)",
month = jun,
year = "2018",
address = "New Orleans, Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N18-2073",
doi = "10.18653/v1/N18-2073",
pages = "458--463"
}
```
### 数据集元数据
显示名称:`wikiclir/zh`
数据集查看器:禁用
源数据集:无
任务类别:
- 文本检索
---
# `wikiclir/zh` 数据集卡片
本`wikiclir/zh` 数据集由 [ir-datasets](https://ir-datasets.com/) 工具包提供。如需了解该数据集的更多详情,请参阅[官方文档](https://ir-datasets.com/wikiclir#wikiclir/zh)。
## 数据概况
本数据集包含以下三类数据:
- `docs`(文档,即检索语料库):共计951,480条
- `queries`(查询请求,即检索主题):共计463,273条
- `qrels`(相关性标注结果):共计926,130条
## 使用方法
可通过如下Python代码加载该数据集的各组成部分:
python
from datasets import load_dataset
# 加载文档语料
docs = load_dataset('irds/wikiclir_zh', 'docs')
for record in docs:
# 单条文档记录格式:{'doc_id': ..., 'title': ..., 'text': ...}
record
# 加载查询数据集
queries = load_dataset('irds/wikiclir_zh', 'queries')
for record in queries:
# 单条查询记录格式:{'query_id': ..., 'text': ...}
record
# 加载相关性标注数据集
qrels = load_dataset('irds/wikiclir_zh', 'qrels')
for record in qrels:
# 单条标注记录格式:{'query_id': ..., 'doc_id': ..., 'relevance': ..., 'iteration': ...}
record
请注意,调用`load_dataset`函数将自动下载该数据集(若数据集未公开,则会提供访问指引),并将其转换为🤗 数据集(Hugging Face Datasets)格式进行本地存储。
## 引用信息
@inproceedings{sasaki-etal-2018-cross,
title = "Cross-Lingual Learning-to-Rank with Shared Representations",
author = "Sasaki, Shota and
Sun, Shuo and
Schamoni, Shigehiko and
Duh, Kevin and
Inui, Kentaro",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)",
month = jun,
year = "2018",
address = "New Orleans, Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N18-2073",
doi = "10.18653/v1/N18-2073",
pages = "458--463"
}