---
license: cc-by-nc-4.0
viewer: false
---
# Baidu ULTR Dataset - Tencent BERT-12l-12h
Query-document vectors and clicks for a subset of the [Baidu Unbiased Learning to Rank](https://arxiv.org/abs/2207.03051) dataset.
This dataset uses the pretrained [BERT cross-encoder (Bert_Layer12_Head12) from Tencent](https://github.com/lixsh6/Tencent_wsdm_cup2023/tree/main/pytorch_unbias) published as part of the WSDM cup 2023 to compute query-document vectors (768 dims).
## Setup
1. Install huggingface [datasets](https://huggingface.co/docs/datasets/installation)
2. Install [pandas](https://github.com/pandas-dev/pandas) and [pyarrow](https://arrow.apache.org/docs/python/index.html): `pip install pandas pyarrow`
3. Optionally, you might need to install a [pyarrow-hotfix](https://github.com/pitrou/pyarrow-hotfix) if you cannot install `pyarrow >= 14.0.1`
4. You can now use the dataset as described below.
## Load train / test click dataset:
```Python
from datasets import load_dataset
dataset = load_dataset(
"philipphager/baidu-ultr_tencent-mlm-ctr",
name="clicks",
split="train", # ["train", "test"]
cache_dir="~/.cache/huggingface",
)
dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]
```
## Load expert annotations:
```Python
from datasets import load_dataset
dataset = load_dataset(
"philipphager/baidu-ultr_tencent-mlm-ctr",
name="annotations",
split="test",
cache_dir="~/.cache/huggingface",
)
dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]
```
## Available features
Each row of the click / annotation dataset contains the following attributes. Use a custom `collate_fn` to select specific features (see below):
### Click dataset
| name | dtype | description |
|------------------------------|----------------|-------------|
| query_id | string | Baidu query_id |
| query_md5 | string | MD5 hash of query text |
| url_md5 | List[string] | MD5 hash of document url, most reliable document identifier |
| text_md5 | List[string] | MD5 hash of document title and abstract |
| query_document_embedding | Tensor[float16]| BERT CLS token |
| click | Tensor[int32] | Click / no click on a document |
| n | int32 | Number of documents for current query, useful for padding |
| position | Tensor[int32] | Position in ranking (does not always match original item position) |
| media_type | Tensor[int32] | Document type (label encoding recommended as ids do not occupy a continous integer range) |
| displayed_time | Tensor[float32]| Seconds a document was displayed on screen |
| serp_height | Tensor[int32] | Pixel height of a document on screen |
| slipoff_count_after_click | Tensor[int32] | Number of times a document was scrolled off screen after previously clicking on it |
### Expert annotation dataset
| name | dtype | description |
|------------------------------|----------------|-------------|
| query_id | string | Baidu query_id |
| query_md5 | string | MD5 hash of query text |
| text_md5 | List[string] | MD5 hash of document title and abstract |
| query_document_embedding | Tensor[float16]| BERT CLS token |
| label | Tensor[int32] | Relevance judgment on a scale from 0 (bad) to 4 (excellent) |
| n | int32 | Number of documents for current query, useful for padding |
| frequency_bucket | int32 | Monthly frequency of query (bucket) from 0 (high frequency) to 9 (low frequency) |
## Example PyTorch collate function
Each sample in the dataset is a single query with multiple documents.
The following example demonstrates how to create a batch containing multiple queries with varying numbers of documents by applying padding:
```Python
import torch
from typing import List
from collections import defaultdict
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
def collate_clicks(samples: List):
batch = defaultdict(lambda: [])
for sample in samples:
batch["query_document_embedding"].append(sample["query_document_embedding"])
batch["position"].append(sample["position"])
batch["click"].append(sample["click"])
batch["n"].append(sample["n"])
return {
"query_document_embedding": pad_sequence(
batch["query_document_embedding"], batch_first=True
),
"position": pad_sequence(batch["position"], batch_first=True),
"click": pad_sequence(batch["click"], batch_first=True),
"n": torch.tensor(batch["n"]),
}
loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16)
```
许可证:CC BY-NC 4.0
查看器:不可用
# 百度ULTR数据集——腾讯BERT-12层-12头(Tencent BERT-12l-12h)
本数据集为[百度无偏排序学习(Baidu Unbiased Learning to Rank)](https://arxiv.org/abs/2207.03051)数据集的子集,包含查询-文档向量与用户点击数据。
本数据集采用WSDM 2023杯赛中公开的[腾讯预训练BERT交叉编码器(Bert_Layer12_Head12)](https://github.com/lixsh6/Tencent_wsdm_cup2023/tree/main/pytorch_unbias)来计算查询-文档向量(维度为768)。
## 环境配置
1. 安装Hugging Face [数据集库(datasets)](https://huggingface.co/docs/datasets/installation)
2. 安装[pandas](https://github.com/pandas-dev/pandas)与[pyarrow](https://arrow.apache.org/docs/python/index.html)库:执行命令`pip install pandas pyarrow`
3. 若无法安装`pyarrow >= 14.0.1`,可按需安装[pyarrow-hotfix](https://github.com/pitrou/pyarrow-hotfix)
4. 即可按照下述方式使用该数据集。
## 加载训练与测试点击数据集
Python
from datasets import load_dataset
dataset = load_dataset(
"philipphager/baidu-ultr_tencent-mlm-ctr",
name="clicks",
split="train", # 可选值为["train", "test"]
cache_dir="~/.cache/huggingface",
)
dataset.set_format("torch") # 支持的格式包括:[None, "numpy", "torch", "tensorflow", "pandas", "arrow"]
## 加载专家标注数据集
Python
from datasets import load_dataset
dataset = load_dataset(
"philipphager/baidu-ultr_tencent-mlm-ctr",
name="annotations",
split="test",
cache_dir="~/.cache/huggingface",
)
dataset.set_format("torch") # 支持的格式包括:[None, "numpy", "torch", "tensorflow", "pandas", "arrow"]
## 可用特征
每个点击/标注数据集的行均包含以下属性,可通过自定义`collate_fn`选择特定特征(详见下文示例):
### 点击数据集
| 字段名 | 数据类型 | 说明 |
|------------------------------|----------------|-------------|
| query_id | 字符串 | 百度查询ID |
| query_md5 | 字符串 | 查询文本的MD5哈希值 |
| url_md5 | 字符串列表(List[string]) | 文档URL的MD5哈希值,为最可靠的文档标识符 |
| text_md5 | 字符串列表(List[string]) | 文档标题与摘要的MD5哈希值 |
| query_document_embedding | 半精度浮点张量(Tensor[float16])| BERT的CLS标记(CLS token)输出 |
| click | 32位整型张量(Tensor[int32]) | 文档是否被点击,1为点击、0为未点击 |
| n | 32位整型(int32) | 当前查询对应的文档总数,用于填充操作 |
| position | 32位整型张量(Tensor[int32]) | 文档在搜索结果列表中的排序位置(未必与原始展示位置一致) |
| media_type | 32位整型张量(Tensor[int32]) | 文档类型(建议采用标签编码,因原始ID并非连续整数) |
| displayed_time | 32位浮点张量(Tensor[float32])| 文档在屏幕上的展示时长(单位:秒) |
| serp_height | 32位整型张量(Tensor[int32]) | 搜索结果页中文档的像素高度 |
| slipoff_count_after_click | 32位整型张量(Tensor[int32]) | 点击文档后将其滚动出屏幕的次数 |
### 专家标注数据集
| 字段名 | 数据类型 | 说明 |
|------------------------------|----------------|-------------|
| query_id | 字符串 | 百度查询ID |
| query_md5 | 字符串 | 查询文本的MD5哈希值 |
| text_md5 | 字符串列表(List[string]) | 文档标题与摘要的MD5哈希值 |
| query_document_embedding | 半精度浮点张量(Tensor[float16])| BERT的CLS标记(CLS token)输出 |
| label | 32位整型张量(Tensor[int32]) | 文档相关性标注,取值范围为0(差)至4(极佳) |
| n | 32位整型(int32) | 当前查询对应的文档总数,用于填充操作 |
| frequency_bucket | 32位整型(int32) | 查询的月度搜索频率分桶,取值0代表高频查询,9代表低频查询 |
## PyTorch批处理拼接示例
数据集中的每个样本对应一个查询及其关联的多篇文档。下述示例演示了如何通过填充操作,将多个包含不同数量文档的查询整合为一个批次:
Python
import torch
from typing import List
from collections import defaultdict
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
def collate_clicks(samples: List):
batch = defaultdict(lambda: [])
for sample in samples:
batch["query_document_embedding"].append(sample["query_document_embedding"])
batch["position"].append(sample["position"])
batch["click"].append(sample["click"])
batch["n"].append(sample["n"])
return {
"query_document_embedding": pad_sequence(
batch["query_document_embedding"], batch_first=True
),
"position": pad_sequence(batch["position"], batch_first=True),
"click": pad_sequence(batch["click"], batch_first=True),
"n": torch.tensor(batch["n"]),
}
loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16)