five

philipphager/baidu-ultr_baidu-mlm-ctr

收藏
Hugging Face2024-02-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/philipphager/baidu-ultr_baidu-mlm-ctr
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 viewer: false --- # Baidu ULTR Dataset - Baidu BERT-12l-12h Query-document vectors and clicks for a subset of the [Baidu Unbiased Learning to Rank dataset](https://arxiv.org/abs/2207.03051). This dataset uses the BERT cross-encoder with 12 layers from Baidu released in the [official starter-kit](https://github.com/ChuXiaokai/baidu_ultr_dataset/) to compute query-document vectors (768 dims). ## Setup 1. Install huggingface [datasets](https://huggingface.co/docs/datasets/installation) 2. Install [pandas](https://github.com/pandas-dev/pandas) and [pyarrow](https://arrow.apache.org/docs/python/index.html): `pip install pandas pyarrow` 3. Optionally, you might need to install a [pyarrow-hotfix](https://github.com/pitrou/pyarrow-hotfix) if you cannot install `pyarrow >= 14.0.1` 4. You can now use the dataset as described below. ## Load train / test click dataset: ```Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="clicks", split="train", # ["train", "test"] cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ``` ## Load expert annotations: ```Python from datasets import load_dataset dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", ) dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"] ``` ## Available features Each row of the click / annotation dataset contains the following attributes. Use a custom `collate_fn` to select specific features (see below): ### Click dataset | name | dtype | description | |------------------------------|----------------|-------------| | query_id | string | Baidu query_id | | query_md5 | string | MD5 hash of query text | | query | List[int32] | List of query tokens | | query_length | int32 | Number of query tokens | | n | int32 | Number of documents for current query, useful for padding | | url_md5 | List[string] | MD5 hash of document URL, most reliable document identifier | | text_md5 | List[string] | MD5 hash of document title and abstract | | title | List[List[int32]] | List of tokens for document titles | | abstract | List[List[int32]] | List of tokens for document abstracts | | query_document_embedding | Tensor[Tensor[float16]]| BERT CLS token | | click | Tensor[int32] | Click / no click on a document | | position | Tensor[int32] | Position in ranking (does not always match original item position) | | media_type | Tensor[int32] | Document type (label encoding recommended as IDs do not occupy a continuous integer range) | | displayed_time | Tensor[float32]| Seconds a document was displayed on the screen | | serp_height | Tensor[int32] | Pixel height of a document on the screen | | slipoff_count_after_click | Tensor[int32] | Number of times a document was scrolled off the screen after previously clicking on it | | bm25 | Tensor[float32] | BM25 score for documents | | bm25_title | Tensor[float32] | BM25 score for document titles | | bm25_abstract | Tensor[float32] | BM25 score for document abstracts | | tf_idf | Tensor[float32] | TF-IDF score for documents | | tf | Tensor[float32] | Term frequency for documents | | idf | Tensor[float32] | Inverse document frequency for documents | | ql_jelinek_mercer_short | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.1) | | ql_jelinek_mercer_long | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.7) | | ql_dirichlet | Tensor[float32] | Query likelihood score for documents using Dirichlet smoothing (lambda = 128) | | document_length | Tensor[int32] | Length of documents | | title_length | Tensor[int32] | Length of document titles | | abstract_length | Tensor[int32] | Length of document abstracts | ### Expert annotation dataset | name | dtype | description | |------------------------------|----------------|-------------| | query_id | string | Baidu query_id | | query_md5 | string | MD5 hash of query text | | query | List[int32] | List of query tokens | | query_length | int32 | Number of query tokens | | frequency_bucket | int32 | Monthly frequency of query (bucket) from 0 (high frequency) to 9 (low frequency) | | n | int32 | Number of documents for current query, useful for padding | | url_md5 | List[string] | MD5 hash of document URL, most reliable document identifier | | text_md5 | List[string] | MD5 hash of document title and abstract | | title | List[List[int32]] | List of tokens for document titles | | abstract | List[List[int32]] | List of tokens for document abstracts | | query_document_embedding | Tensor[Tensor[float16]] | BERT CLS token | | label | Tensor[int32] | Relevance judgments on a scale from 0 (bad) to 4 (excellent) | | bm25 | Tensor[float32] | BM25 score for documents | | bm25_title | Tensor[float32] | BM25 score for document titles | | bm25_abstract | Tensor[float32] | BM25 score for document abstracts | | tf_idf | Tensor[float32] | TF-IDF score for documents | | tf | Tensor[float32] | Term frequency for documents | | idf | Tensor[float32] | Inverse document frequency for documents | | ql_jelinek_mercer_short | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.1) | | ql_jelinek_mercer_long | Tensor[float32] | Query likelihood score for documents using Jelinek-Mercer smoothing (alpha = 0.7) | | ql_dirichlet | Tensor[float32] | Query likelihood score for documents using Dirichlet smoothing (lambda = 128) | | document_length | Tensor[int32] | Length of documents | | title_length | Tensor[int32] | Length of document titles | | abstract_length | Tensor[int32] | Length of document abstracts | ## Example PyTorch collate function Each sample in the dataset is a single query with multiple documents. The following example demonstrates how to create a batch containing multiple queries with varying numbers of documents by applying padding: ```Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader def collate_clicks(samples: List): batch = defaultdict(lambda: []) for sample in samples: batch["query_document_embedding"].append(sample["query_document_embedding"]) batch["position"].append(sample["position"]) batch["click"].append(sample["click"]) batch["n"].append(sample["n"]) return { "query_document_embedding": pad_sequence( batch["query_document_embedding"], batch_first=True ), "position": pad_sequence(batch["position"], batch_first=True), "click": pad_sequence(batch["click"], batch_first=True), "n": torch.tensor(batch["n"]), } loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16) ```
提供机构:
philipphager
原始信息汇总

Baidu ULTR Dataset - Baidu BERT-12l-12h

数据集概述

该数据集包含查询-文档向量和点击数据,是Baidu Unbiased Learning to Rank 数据集的一个子集。使用百度发布的BERT交叉编码器(12层)计算查询-文档向量(768维)。

数据加载

加载训练/测试点击数据集

Python from datasets import load_dataset

dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="clicks", split="train", # ["train", "test"] cache_dir="~/.cache/huggingface", )

dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]

加载专家标注数据集

Python from datasets import load_dataset

dataset = load_dataset( "philipphager/baidu-ultr_baidu-mlm-ctr", name="annotations", split="test", cache_dir="~/.cache/huggingface", )

dataset.set_format("torch") # [None, "numpy", "torch", "tensorflow", "pandas", "arrow"]

可用特征

点击数据集

名称 数据类型 描述
query_id string 百度查询ID
query_md5 string 查询文本的MD5哈希值
query List[int32] 查询词列表
query_length int32 查询词数量
n int32 当前查询的文档数量,用于填充
url_md5 List[string] 文档URL的MD5哈希值,最可靠的文档标识符
text_md5 List[string] 文档标题和摘要的MD5哈希值
title List[List[int32]] 文档标题的词列表
abstract List[List[int32]] 文档摘要的词列表
query_document_embedding Tensor[Tensor[float16]] BERT CLS 标记
click Tensor[int32] 文档点击情况
position Tensor[int32] 排名中的位置(不一定与原始项目位置匹配)
media_type Tensor[int32] 文档类型(推荐使用标签编码,因为ID不连续)
displayed_time Tensor[float32] 文档在屏幕上显示的秒数
serp_height Tensor[int32] 文档在屏幕上的像素高度
slipoff_count_after_click Tensor[int32] 点击后文档滑出屏幕的次数
bm25 Tensor[float32] 文档的BM25分数
bm25_title Tensor[float32] 文档标题的BM25分数
bm25_abstract Tensor[float32] 文档摘要的BM25分数
tf_idf Tensor[float32] 文档的TF-IDF分数
tf Tensor[float32] 文档的词频
idf Tensor[float32] 文档的逆文档频率
ql_jelinek_mercer_short Tensor[float32] 使用Jelinek-Mercer平滑(alpha = 0.1)的查询似然分数
ql_jelinek_mercer_long Tensor[float32] 使用Jelinek-Mercer平滑(alpha = 0.7)的查询似然分数
ql_dirichlet Tensor[float32] 使用Dirichlet平滑(lambda = 128)的查询似然分数
document_length Tensor[int32] 文档长度
title_length Tensor[int32] 文档标题长度
abstract_length Tensor[int32] 文档摘要长度

专家标注数据集

名称 数据类型 描述
query_id string 百度查询ID
query_md5 string 查询文本的MD5哈希值
query List[int32] 查询词列表
query_length int32 查询词数量
frequency_bucket int32 查询的月频率(桶),从0(高频率)到9(低频率)
n int32 当前查询的文档数量,用于填充
url_md5 List[string] 文档URL的MD5哈希值,最可靠的文档标识符
text_md5 List[string] 文档标题和摘要的MD5哈希值
title List[List[int32]] 文档标题的词列表
abstract List[List[int32]] 文档摘要的词列表
query_document_embedding Tensor[Tensor[float16]] BERT CLS 标记
label Tensor[int32] 相关性判断,从0(差)到4(优秀)
bm25 Tensor[float32] 文档的BM25分数
bm25_title Tensor[float32] 文档标题的BM25分数
bm25_abstract Tensor[float32] 文档摘要的BM25分数
tf_idf Tensor[float32] 文档的TF-IDF分数
tf Tensor[float32] 文档的词频
idf Tensor[float32] 文档的逆文档频率
ql_jelinek_mercer_short Tensor[float32] 使用Jelinek-Mercer平滑(alpha = 0.1)的查询似然分数
ql_jelinek_mercer_long Tensor[float32] 使用Jelinek-Mercer平滑(alpha = 0.7)的查询似然分数
ql_dirichlet Tensor[float32] 使用Dirichlet平滑(lambda = 128)的查询似然分数
document_length Tensor[int32] 文档长度
title_length Tensor[int32] 文档标题长度
abstract_length Tensor[int32] 文档摘要长度

示例 PyTorch 批处理函数

每个样本是单个查询和多个文档。以下示例展示了如何通过填充创建包含多个查询和不同数量文档的批次:

Python import torch from typing import List from collections import defaultdict from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader

def collate_clicks(samples: List): batch = defaultdict(lambda: [])

for sample in samples:
    batch["query_document_embedding"].append(sample["query_document_embedding"])
    batch["position"].append(sample["position"])
    batch["click"].append(sample["click"])
    batch["n"].append(sample["n"])

return {
    "query_document_embedding": pad_sequence(
        batch["query_document_embedding"], batch_first=True
    ),
    "position": pad_sequence(batch["position"], batch_first=True),
    "click": pad_sequence(batch["click"], batch_first=True),
    "n": torch.tensor(batch["n"]),
}

loader = DataLoader(dataset, collate_fn=collate_clicks, batch_size=16)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作