philipphager/baidu-ultr_baidu-mlm-ctr

Name: philipphager/baidu-ultr_baidu-mlm-ctr
Creator: philipphager
Published: 2024-02-01 08:49:55
License: 暂无描述

Hugging Face2024-02-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/philipphager/baidu-ultr_baidu-mlm-ctr

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 viewer: false --- # Baidu ULTR Dataset - Baidu BERT-12l-12h Query-document vectors and clicks for a subset of the [Baidu Unbiased Learning to Rank dataset](https://arxiv.org/abs/2207.03051). This dataset uses the BERT cross-encoder with 12 layers from Baidu released in the [official starter-kit](https://github.com/ChuXiaokai/baidu_ultr_dataset/) to compute query-document vectors (768 dims). ## Setup 1. Install huggingface [datasets](https://huggingface.co/docs/datasets/installation) 2. Install [pandas](https://github.com/pandas-dev/pandas) and [pyarrow](https://arrow.apache.org/docs/python/index.html): `pip install pandas pyarrow` 3. Optionally, you might need to install a [pyarrow-hotfix](https://github.com/pitrou/pyarrow-hotfix) if you cannot install `pyarrow >= 14.0.1` 4. You can now use the dataset as described below. ## Load train / test click dataset: ```Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="clicks", split="train", # ["train", "test"] cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ``` ## Load expert annotations: ```Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ``` ## Available features Each row of the click / annotation dataset contains the following attributes. Use a custom `collate_fn` to select specific features (see below): ### Click dataset | name | dtype | description | |------------------------------|----------------|-------------| | query_id | string | Baidu query_id | | query_md5 | string | MD5 hash of query text | | query | List[int32] | List of query tokens | | query_length | int32 | Number of query tokens | | n | int32 | Number of documents for current query, useful for padding | | url_md5 | List[string] | MD5 hash of document URL, most reliable document identifier | | text_md5 | List[string] | MD5 hash of document title and abstract | | title | List[List[int32]] | List of tokens for document titles | | abstract | List[List[int32]] | List of tokens for document abstracts | | query_document_embedding | Tensor[Tensor[float16]]| BERT CLS token | | click | Tensor[int32] | Click / no click on a document | | position | Tensor[int32] | Position in ranking (does not always match original item position) | | media_type | Tensor[int32] | Document type (label encoding recommended as IDs do not occupy a continuous integer range) | | displayed_time | Tensor[float32]| Seconds a document was displayed on the screen | | serp_height | Tensor[int32] | Pixel height of a document on the screen | | slipoff_count_after_click | Tensor[int32] | Number of times a document was scrolled off the screen after previously clicking on it | | bm25 | Tensor[float32] | BM25 score for documents | | bm25_title | Tensor[float32] | BM25 score for document titles | | bm25_abstract | Tensor[float32] | BM25 score for document abstracts | | tf_idf | Tensor[float32] | TF-IDF score for documents | | tf | Tensor[float32] | Term frequency for documents | | idf | Tensor[float32] | Inverse document frequency for documents | | ql_jelinek_mercer_short | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.1) | | ql_jelinek_mercer_long | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.7) | | ql_dirichlet | Tensor[float32] | Query likelihood score for documents using Dirichlet smoothing (lambda = 128) | | document_length | Tensor[int32] | Length of documents | | title_length | Tensor[int32] | Length of document titles | | abstract_length | Tensor[int32] | Length of document abstracts | ### Expert annotation dataset | name | dtype | description | |------------------------------|----------------|-------------| | query_id | string | Baidu query_id | | query_md5 | string | MD5 hash of query text | | query | List[int32] | List of query tokens | | query_length | int32 | Number of query tokens | | frequency_bucket | int32 | Monthly frequency of query (bucket) from 0 (high frequency) to 9 (low frequency) | | n | int32 | Number of documents for current query, useful for padding | | url_md5 | List[string] | MD5 hash of document URL, most reliable document identifier | | text_md5 | List[string] | MD5 hash of document title and abstract | | title | List[List[int32]] | List of tokens for document titles | | abstract | List[List[int32]] | List of tokens for document abstracts | | query_document_embedding | Tensor[Tensor[float16]] | BERT CLS token | | label | Tensor[int32] | Relevance judgments on a scale from 0 (bad) to 4 (excellent) | | bm25 | Tensor[float32] | BM25 score for documents | | bm25_title | Tensor[float32] | BM25 score for document titles | | bm25_abstract | Tensor[float32] | BM25 score for document abstracts | | tf_idf | Tensor[float32] | TF-IDF score for documents | | tf | Tensor[float32] | Term frequency for documents | | idf | Tensor[float32] | Inverse document frequency for documents | | ql_jelinek_mercer_short | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.1) | | ql_jelinek_mercer_long | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.7) | | ql_dirichlet | Tensor[float32] | Query likelihood score for documents using Dirichlet smoothing (lambda = 128) | | document_length | Tensor[int32] | Length of documents | | title_length | Tensor[int32] | Length of document titles | | abstract_length | Tensor[int32] | Length of document abstracts | ## Example PyTorch collate function Each sample in the dataset is a single query with multiple documents. The following example demonstrates how to create a batch containing multiple queries with varying numbers of documents by applying padding: ```Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader def collate_clicks(samples: List): batch = defaultdict(lambda: []) for sample in samples: batch["query_document_embedding"].append(sample["query_document_embedding"]) batch["position"].append(sample["position"]) batch["click"].append(sample["click"]) batch["n"].append(sample["n"]) return { "query_document_embedding": pad_sequence( batch["query_document_embedding"], batch_first=True ), "position": pad_sequence(batch["position"], batch_first=True), "click": pad_sequence(batch["click"], batch_first=True), "n": torch.tensor(batch["n"]), } loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16) ```

提供机构：

philipphager

原始信息汇总

Baidu ULTR Dataset - Baidu BERT-12l-12h

数据集概述

该数据集包含查询-文档向量和点击数据，是Baidu Unbiased Learning to Rank 数据集的一个子集。使用百度发布的BERT交叉编码器（12层）计算查询-文档向量（768维）。

数据加载

加载训练/测试点击数据集

Python from datasets import load_dataset

dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="clicks", split="train", # ["train", "test"] cache_dir="~/.cache/huggingface", )

dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]

加载专家标注数据集

Python from datasets import load_dataset

dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", )

dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]

可用特征

点击数据集

名称	数据类型	描述
query_id	string	百度查询ID
query_md5	string	查询文本的MD5哈希值
query	List[int32]	查询词列表
query_length	int32	查询词数量
n	int32	当前查询的文档数量，用于填充
url_md5	List[string]	文档URL的MD5哈希值，最可靠的文档标识符
text_md5	List[string]	文档标题和摘要的MD5哈希值
title	List[List[int32]]	文档标题的词列表
abstract	List[List[int32]]	文档摘要的词列表
query_document_embedding	Tensor[Tensor[float16]]	BERT CLS 标记
click	Tensor[int32]	文档点击情况
position	Tensor[int32]	排名中的位置（不一定与原始项目位置匹配）
media_type	Tensor[int32]	文档类型（推荐使用标签编码，因为ID不连续）
displayed_time	Tensor[float32]	文档在屏幕上显示的秒数
serp_height	Tensor[int32]	文档在屏幕上的像素高度
slipoff_count_after_click	Tensor[int32]	点击后文档滑出屏幕的次数
bm25	Tensor[float32]	文档的BM25分数
bm25_title	Tensor[float32]	文档标题的BM25分数
bm25_abstract	Tensor[float32]	文档摘要的BM25分数
tf_idf	Tensor[float32]	文档的TF-IDF分数
tf	Tensor[float32]	文档的词频
idf	Tensor[float32]	文档的逆文档频率
ql_jelinek_mercer_short	Tensor[float32]	使用Jelinek-Mercer平滑（alpha = 0.1）的查询似然分数
ql_jelinek_mercer_long	Tensor[float32]	使用Jelinek-Mercer平滑（alpha = 0.7）的查询似然分数
ql_dirichlet	Tensor[float32]	使用Dirichlet平滑（lambda = 128）的查询似然分数
document_length	Tensor[int32]	文档长度
title_length	Tensor[int32]	文档标题长度
abstract_length	Tensor[int32]	文档摘要长度

专家标注数据集

名称	数据类型	描述
query_id	string	百度查询ID
query_md5	string	查询文本的MD5哈希值
query	List[int32]	查询词列表
query_length	int32	查询词数量
frequency_bucket	int32	查询的月频率（桶），从0（高频率）到9（低频率）
n	int32	当前查询的文档数量，用于填充
url_md5	List[string]	文档URL的MD5哈希值，最可靠的文档标识符
text_md5	List[string]	文档标题和摘要的MD5哈希值
title	List[List[int32]]	文档标题的词列表
abstract	List[List[int32]]	文档摘要的词列表
query_document_embedding	Tensor[Tensor[float16]]	BERT CLS 标记
label	Tensor[int32]	相关性判断，从0（差）到4（优秀）
bm25	Tensor[float32]	文档的BM25分数
bm25_title	Tensor[float32]	文档标题的BM25分数
bm25_abstract	Tensor[float32]	文档摘要的BM25分数
tf_idf	Tensor[float32]	文档的TF-IDF分数
tf	Tensor[float32]	文档的词频
idf	Tensor[float32]	文档的逆文档频率
ql_jelinek_mercer_short	Tensor[float32]	使用Jelinek-Mercer平滑（alpha = 0.1）的查询似然分数
ql_jelinek_mercer_long	Tensor[float32]	使用Jelinek-Mercer平滑（alpha = 0.7）的查询似然分数
ql_dirichlet	Tensor[float32]	使用Dirichlet平滑（lambda = 128）的查询似然分数
document_length	Tensor[int32]	文档长度
title_length	Tensor[int32]	文档标题长度
abstract_length	Tensor[int32]	文档摘要长度

示例 PyTorch 批处理函数

每个样本是单个查询和多个文档。以下示例展示了如何通过填充创建包含多个查询和不同数量文档的批次：

Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader

def collate_clicks(samples: List): batch = defaultdict(lambda: [])

for sample in samples:
    batch["query_document_embedding"].append(sample["query_document_embedding"])
    batch["position"].append(sample["position"])
    batch["click"].append(sample["click"])
    batch["n"].append(sample["n"])

return {
    "query_document_embedding": pad_sequence(
        batch["query_document_embedding"], batch_first=True
    ),
    "position": pad_sequence(batch["position"], batch_first=True),
    "click": pad_sequence(batch["click"], batch_first=True),
    "n": torch.tensor(batch["n"]),
}

loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16)

5,000+

优质数据集

54 个

任务类型

进入经典数据集