philipphager/baidu-ultr_tencent-mlm-ctr

Name: philipphager/baidu-ultr_tencent-mlm-ctr
Creator: philipphager
Published: 2024-01-21 14:28:19
License: 暂无描述

Hugging Face2024-01-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/philipphager/baidu-ultr_tencent-mlm-ctr

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 viewer: false --- # Baidu ULTR Dataset - Tencent BERT-12l-12h Query-document vectors and clicks for a subset of the [Baidu Unbiased Learning to Rank](https://arxiv.org/abs/2207.03051) dataset. This dataset uses the pretrained [BERT cross-encoder (Bert_Layer12_Head12) from Tencent](https://github.com/lixsh6/Tencent_wsdm_cup2023/tree/main/pytorch_unbias) published as part of the WSDM cup 2023 to compute query-document vectors (768 dims). ## Setup 1. Install huggingface [datasets](https://huggingface.co/docs/datasets/installation) 2. Install [pandas](https://github.com/pandas-dev/pandas) and [pyarrow](https://arrow.apache.org/docs/python/index.html): `pip install pandas pyarrow` 3. Optionally, you might need to install a [pyarrow-hotfix](https://github.com/pitrou/pyarrow-hotfix) if you cannot install `pyarrow >= 14.0.1` 4. You can now use the dataset as described below. ## Load train / test click dataset: ```Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_tencent-mlm-ctr", name="clicks", split="train", # ["train", "test"] cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ``` ## Load expert annotations: ```Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_tencent-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ``` ## Available features Each row of the click / annotation dataset contains the following attributes. Use a custom `collate_fn` to select specific features (see below): ### Click dataset | name | dtype | description | |------------------------------|----------------|-------------| | query_id | string | Baidu query_id | | query_md5 | string | MD5 hash of query text | | url_md5 | List[string] | MD5 hash of document url, most reliable document identifier | | text_md5 | List[string] | MD5 hash of document title and abstract | | query_document_embedding | Tensor[float16]| BERT CLS token | | click | Tensor[int32] | Click / no click on a document | | n | int32 | Number of documents for current query, useful for padding | | position | Tensor[int32] | Position in ranking (does not always match original item position) | | media_type | Tensor[int32] | Document type (label encoding recommended as ids do not occupy a continous integer range) | | displayed_time | Tensor[float32]| Seconds a document was displayed on screen | | serp_height | Tensor[int32] | Pixel height of a document on screen | | slipoff_count_after_click | Tensor[int32] | Number of times a document was scrolled off screen after previously clicking on it | ### Expert annotation dataset | name | dtype | description | |------------------------------|----------------|-------------| | query_id | string | Baidu query_id | | query_md5 | string | MD5 hash of query text | | text_md5 | List[string] | MD5 hash of document title and abstract | | query_document_embedding | Tensor[float16]| BERT CLS token | | label | Tensor[int32] | Relevance judgment on a scale from 0 (bad) to 4 (excellent) | | n | int32 | Number of documents for current query, useful for padding | | frequency_bucket | int32 | Monthly frequency of query (bucket) from 0 (high frequency) to 9 (low frequency) | ## Example PyTorch collate function Each sample in the dataset is a single query with multiple documents. The following example demonstrates how to create a batch containing multiple queries with varying numbers of documents by applying padding: ```Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader def collate_clicks(samples: List): batch = defaultdict(lambda: []) for sample in samples: batch["query_document_embedding"].append(sample["query_document_embedding"]) batch["position"].append(sample["position"]) batch["click"].append(sample["click"]) batch["n"].append(sample["n"]) return { "query_document_embedding": pad_sequence( batch["query_document_embedding"], batch_first=True ), "position": pad_sequence(batch["position"], batch_first=True), "click": pad_sequence(batch["click"], batch_first=True), "n": torch.tensor(batch["n"]), } loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16) ```

许可证：CC BY-NC 4.0 查看器：不可用 # 百度ULTR数据集——腾讯BERT-12层-12头（Tencent BERT-12l-12h）本数据集为[百度无偏排序学习（Baidu Unbiased Learning to Rank）](https://arxiv.org/abs/2207.03051)数据集的子集，包含查询-文档向量与用户点击数据。本数据集采用WSDM 2023杯赛中公开的[腾讯预训练BERT交叉编码器（Bert_Layer12_Head12）](https://github.com/lixsh6/Tencent_wsdm_cup2023/tree/main/pytorch_unbias)来计算查询-文档向量（维度为768）。 ## 环境配置 1. 安装Hugging Face [数据集库（datasets）](https://huggingface.co/docs/datasets/installation) 2. 安装[pandas](https://github.com/pandas-dev/pandas)与[pyarrow](https://arrow.apache.org/docs/python/index.html)库：执行命令`pip install pandas pyarrow` 3. 若无法安装`pyarrow >= 14.0.1`，可按需安装[pyarrow-hotfix](https://github.com/pitrou/pyarrow-hotfix) 4. 即可按照下述方式使用该数据集。 ## 加载训练与测试点击数据集 Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_tencent-mlm-ctr", name="clicks", split="train", # 可选值为["train", "test"] cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # 支持的格式包括：[None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ## 加载专家标注数据集 Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_tencent-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # 支持的格式包括：[None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ## 可用特征每个点击/标注数据集的行均包含以下属性，可通过自定义`collate_fn`选择特定特征（详见下文示例）： ### 点击数据集 | 字段名 | 数据类型 | 说明 | |------------------------------|----------------|-------------| | query_id | 字符串 | 百度查询ID | | query_md5 | 字符串 | 查询文本的MD5哈希值 | | url_md5 | 字符串列表（List[string]） | 文档URL的MD5哈希值，为最可靠的文档标识符 | | text_md5 | 字符串列表（List[string]） | 文档标题与摘要的MD5哈希值 | | query_document_embedding | 半精度浮点张量（Tensor[float16]）| BERT的CLS标记（CLS token）输出 | | click | 32位整型张量（Tensor[int32]） | 文档是否被点击，1为点击、0为未点击 | | n | 32位整型（int32） | 当前查询对应的文档总数，用于填充操作 | | position | 32位整型张量（Tensor[int32]） | 文档在搜索结果列表中的排序位置（未必与原始展示位置一致） | | media_type | 32位整型张量（Tensor[int32]） | 文档类型（建议采用标签编码，因原始ID并非连续整数） | | displayed_time | 32位浮点张量（Tensor[float32]）| 文档在屏幕上的展示时长（单位：秒） | | serp_height | 32位整型张量（Tensor[int32]） | 搜索结果页中文档的像素高度 | | slipoff_count_after_click | 32位整型张量（Tensor[int32]） | 点击文档后将其滚动出屏幕的次数 | ### 专家标注数据集 | 字段名 | 数据类型 | 说明 | |------------------------------|----------------|-------------| | query_id | 字符串 | 百度查询ID | | query_md5 | 字符串 | 查询文本的MD5哈希值 | | text_md5 | 字符串列表（List[string]） | 文档标题与摘要的MD5哈希值 | | query_document_embedding | 半精度浮点张量（Tensor[float16]）| BERT的CLS标记（CLS token）输出 | | label | 32位整型张量（Tensor[int32]） | 文档相关性标注，取值范围为0（差）至4（极佳） | | n | 32位整型（int32） | 当前查询对应的文档总数，用于填充操作 | | frequency_bucket | 32位整型（int32） | 查询的月度搜索频率分桶，取值0代表高频查询，9代表低频查询 | ## PyTorch批处理拼接示例数据集中的每个样本对应一个查询及其关联的多篇文档。下述示例演示了如何通过填充操作，将多个包含不同数量文档的查询整合为一个批次： Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader def collate_clicks(samples: List): batch = defaultdict(lambda: []) for sample in samples: batch["query_document_embedding"].append(sample["query_document_embedding"]) batch["position"].append(sample["position"]) batch["click"].append(sample["click"]) batch["n"].append(sample["n"]) return { "query_document_embedding": pad_sequence( batch["query_document_embedding"], batch_first=True ), "position": pad_sequence(batch["position"], batch_first=True), "click": pad_sequence(batch["click"], batch_first=True), "n": torch.tensor(batch["n"]), } loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16)

提供机构：

philipphager

原始信息汇总

Baidu ULTR Dataset - Tencent BERT-12l-12h

数据集概述

该数据集包含来自Baidu Unbiased Learning to Rank数据集的查询-文档向量和点击数据子集。使用Tencent发布的预训练BERT cross-encoder (Bert_Layer12_Head12)计算查询-文档向量（768维）。

数据加载

加载训练/测试点击数据集

Python from datasets import load_dataset

dataset = load_dataset( "philipphager/baidu-ultr_tencent-mlm-ctr", name="clicks", split="train", # ["train", "test"] cache_dir="~/.cache/huggingface", )

dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]

加载专家标注数据集

Python from datasets import load_dataset

dataset = load_dataset( "philipphager/baidu-ultr_tencent-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", )

dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]

可用特征

点击数据集

名称	数据类型	描述
query_id	string	Baidu查询ID
query_md5	string	查询文本的MD5哈希
url_md5	List[string]	文档URL的MD5哈希，最可靠的文档标识符
text_md5	List[string]	文档标题和摘要的MD5哈希
query_document_embedding	Tensor[float16]	BERT CLS token
click	Tensor[int32]	文档点击/未点击
n	int32	当前查询的文档数量，用于填充
position	Tensor[int32]	排名中的位置（不一定与原始项目位置匹配）
media_type	Tensor[int32]	文档类型（推荐使用标签编码，因为ID不占据连续整数范围）
displayed_time	Tensor[float32]	文档在屏幕上显示的秒数
serp_height	Tensor[int32]	文档在屏幕上的像素高度
slipoff_count_after_click	Tensor[int32]	点击后文档被滚动出屏幕的次数

专家标注数据集

名称	数据类型	描述
query_id	string	Baidu查询ID
query_md5	string	查询文本的MD5哈希
text_md5	List[string]	文档标题和摘要的MD5哈希
query_document_embedding	Tensor[float16]	BERT CLS token
label	Tensor[int32]	相关性判断，范围从0（差）到4（优秀）
n	int32	当前查询的文档数量，用于填充
frequency_bucket	int32	查询的月频率（桶），从0（高频率）到9（低频率）

示例PyTorch collate函数

每个样本在数据集中是一个带有多个文档的查询。以下示例展示了如何通过应用填充来创建包含多个查询和不同数量文档的批次：

Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader

def collate_clicks(samples: List): batch = defaultdict(lambda: [])

for sample in samples:
    batch["query_document_embedding"].append(sample["query_document_embedding"])
    batch["position"].append(sample["position"])
    batch["click"].append(sample["click"])
    batch["n"].append(sample["n"])

return {
    "query_document_embedding": pad_sequence(
        batch["query_document_embedding"], batch_first=True
    ),
    "position": pad_sequence(batch["position"], batch_first=True),
    "click": pad_sequence(batch["click"], batch_first=True),
    "n": torch.tensor(batch["n"]),
}

loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16)

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集