Mozilla/history-search-retrieval
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mozilla/history-search-retrieval
下载链接
链接失效反馈官方服务:
资源简介:
---
viewer: true
configs:
- config_name: docs
data_files: data/v1/docs.parquet
- config_name: queries
data_files: data/v1/queries.parquet
- config_name: qrels
data_files: data/v1/qrels.parquet
license: mit
language:
- en
---
# Semantic History (Synthetic)
**Semantic History** is a synthetic dataset for **semantic history search** research.
It ships as three **normalized** Parquet tables:
| Split | Description |
|-----------|-----------------------------------------------------------------------------------------------------|
| `docs` | Search history records with `url`, `title`, `description`, `frecency`, `last_visit_date`, and tags. |
| `queries` | One row per query, tagged by profile and temporal/multi-label flags. |
| `qrels` | Relevance pairs linking `queries` ↔ `docs` with `rank` and `relevance`. |
All content is **synthetic** (no real browsing logs).
---
## Temporal Variant
The dataset includes a **temporal** slice designed to test retrieval with time-aware queries (e.g., *“yesterday”*, *“last week”*, *“on Black Friday”*).
Temporal queries in `queries` are tagged with `variant="temporal"` and carry a per-query reference time `ref_datetime_iso`, which is the anchor used to resolve relative phrases (e.g., “yesterday” → specific date range).
For reproducibility, we also publish the **raw temporal profiles** (one folder per profile) that were used to construct the Parquet release. Each temporal profile contains:
- `history.csv` — history rows with `last_visit_date` (µs epoch) and `frecency`
- `query.csv` — temporal queries and expected matches
- `temporal_context.json` — reference time and locale rules (e.g., timezone, weekend, date format)
> See the detailed temporal documentation and per-locale notes here:
> https://huggingface.co/datasets/Mozilla/history-search-retrieval/blob/main/raw/profiles/temporal/README.md
---
## Motivation
- **User-oriented IR** (per-profile history retrieval)
- **Temporal-aware retrieval** (e.g., profile histories with a reference time)
- **Embedding & ranking evaluation** on synthetic history traces
---
## Data Layout
```
data/v1/
├── docs.parquet
├── queries.parquet
└── qrels.parquet
```
**Columns**
- **docs**:
`doc_id, url, title, description, frecency, last_visit_date, profile, profile_id, variant`
- **queries**:
`query_id, search_query, profile, profile_id, is_temporal, is_multi, variant, ref_datetime_iso`
- **qrels**:
`query_id, doc_id, profile_id, relevance, rank, variant`
> `profile_id` is a stable, hashed identifier per profile folder; `variant` can be `temporal` or other configured variants; `is_multi` indicates multi-label queries.
---
## Firefox Places Background (schema context)
The dataset `docs` mimics Firefox’s Places DB:
- `moz_places` table schema:
https://searchfox.org/firefox-main/source/toolkit/components/places/nsPlacesTables.h
- Length limits referenced in Places utils:
`title` ≤ **4096** chars, `description` ≤ **256**
https://searchfox.org/firefox-main/source/toolkit/components/places/PlacesUtils.sys.mjs#162
- `title` from DOM `<title>`:
https://searchfox.org/firefox-main/source/dom/svg/SVGTitleElement.cpp
- `description` from prioritized page metadata:
https://searchfox.org/firefox-main/source/toolkit/actors/ContentMetaChild.sys.mjs#12
> **Embedding text**: we use **`title + description`**.
---
## How to Load (HF Datasets)
```python
from datasets import load_dataset
dataset_id = "Mozilla/history-search-retrieval"
docs_pd = load_dataset(dataset_id, name="docs")["train"]
queries_pd = load_dataset(dataset_id, name="queries")["train"]
qrels_pd = load_dataset(dataset_id, name="qrels")["train"]
```
---
## Common Operations
### HF Datasets Version
#### List available profiles
```python
profiles = (docs.to_pandas()[["profile_id","profile","variant"]]
.drop_duplicates()
.sort_values(["profile","variant"]))
```
#### Filter by profile
```python
pid = "262e49ec20c32c41"
p_docs = docs.filter(lambda x: x["profile_id"] == pid)
p_queries = queries.filter(lambda x: x["profile_id"] == pid)
p_qrels = qrels.filter(lambda x: x["profile_id"] == pid)
```
#### Temporal / multi-label slices
```python
q_temporal = queries.filter(lambda x: x["variant"] == "temporal")
q_multi = queries.filter(lambda x: x["is_multi"])
```
---
### Pandas Version
```python
import pandas as pd
from datasets import load_dataset
dataset_id = "Mozilla/history-search-retrieval"
docs_pd = load_dataset(dataset_id, name="docs")["train"].to_pandas()
queries_pd = load_dataset(dataset_id, name="queries")["train"].to_pandas()
qrels_pd = load_dataset(dataset_id, name="qrels")["train"].to_pandas()
q = queries_pd[["query_id","search_query","profile_id","variant","is_multi"]].set_index("query_id")
r = qrels_pd[["query_id","doc_id","rank","relevance"]].set_index("query_id")
d = docs_pd[["doc_id","url","title"]].set_index("doc_id")
# Reconstruct (query <-> doc/url) pairs
qr = r.join(q, how="inner").reset_index()
query_pairs = (qr.join(d, on="doc_id", how="left")
.sort_values(["query_id","rank"])
.reset_index(drop=True))
```
## Evaluation
This dataset is intended for retrieval evaluation.
Please see the [repository](https://github.com/mozilla/smart_search) for the evaluation scripts/notebooks and metric implementations:
- Precision@k, Recall@k, nDCG@k
- Reciprocal Rank (RR), Average Precision (AP)
- On-Topic Rate@k
## Synthetic Data Generation
All profiles, queries, and qrels are **synthetic**.
The pipeline creates per-profile histories and LLM-judged relevance pairs from public English documents.
**Overview**
1. **Source**: [MS MARCO documents](https://microsoft.github.io/msmarco/Datasets.html) (msmarco-docs.tsv); keep top 500k rows with `docid,url,title,body` (English).
2. **Normalize**: build a unified table with `url,title,description,topic,lang,domain,combined_text`
- description = first 300 chars of `body`
- filter titles to length 5-200
- `combined_text = title + " " + description`
3. **Sample**: draw ~50k examples (English-only for this HF release).
4. **Profiles**: create 25 synthetic profiles; for each, sample 1k-5k items aligned with profile themes; set random `frecency` (100-5000) and incremental `last_visit_date`.
5. **Queries & qrels**: generate profile-specific queries with an LLM; judge relevance over the profile history; export `qrels` with `rank` and `relevance=1` and concise `query.csv` per profile.
**Code** (full scripts & notebooks):
- https://github.com/mozilla/smart_search/tree/temporal_awareness/preprocessing/generate_profiles
- https://github.com/mozilla/smart_search/blob/temporal_awareness/notebooks/generate_history.ipynb
### Additional Dataset
We also include a second dataset built from publicly available synthetic histories:
- Source repo: https://github.com/komosny/synthetic-browsing-history
- Details/paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11754914/
Countries (English-focused): Australia, Canada, United Kingdom, United States.
**Preprocessing**
- Deduplicate by URL.
- Fetch title and description with priorities:
- title: `<title>` → `meta[property=og:title]` → `meta[name=twitter:title]` → `<h1>`
- description: `meta[name=description]` → `meta[property=og:description]` → `meta[name=twitter:description]` → `summary`
- Enforce Firefox-style limits: title ≤ 4096, description ≤ 256.
- Drop records with no title and no description.
**Query Construction**
- For each profile, randomly sample 50 URLs.
- Use an LLM (gpt-5-mini) to generate 50 semantic search queries that should retrieve the given URL, conditioning on its title and description.
## Raw Layer
We maintain a reference `raw/` tree for repro, but the canonical interface is the Parquet layer:
```
raw/profiles/<variant>/<single|multi-label>/<profile>/
├── history.csv
├── query.csv
└── temporal_context.json # only for temporal
```
> Use Parquet for all experiments.
> `raw/` is reference-only; no remote execution loaders.
---
查看器:启用
数据集配置:
- 配置名称:docs,数据文件:data/v1/docs.parquet
- 配置名称:queries,数据文件:data/v1/queries.parquet
- 配置名称:qrels(查询相关性对),数据文件:data/v1/qrels.parquet
许可证:MIT
语言:英语
---
# 语义历史(合成数据集)
**语义历史(Semantic History)** 是一款用于**语义历史搜索**研究的合成数据集。该数据集以三张**归一化** Parquet 数据表的形式发布:
| 拆分名称 | 描述 |
|-----------|-----------------------------------------------------------------------------------------------------|
| `docs` | 包含`url`、`title`、`description`、`frecency`、`last_visit_date`及标签的搜索历史记录。 |
| `queries` | 每条查询对应一行数据,通过用户画像与时间感知/多标签标记进行标注。 |
| `qrels` | 连接`queries`与`docs`的相关性对,包含`rank`(排序位次)与`relevance`(相关性分值)。 |
所有内容均为**合成生成**(无真实浏览日志)。
---
## 时间感知变体
该数据集包含一个**时间感知**子集,用于测试基于时间相关查询的检索任务(例如:*“昨天”*、*“上周”*、*“黑色星期五当天”*)。`queries` 中的时间感知查询会被标记为`variant="temporal"`,并附带每条查询的参考时间`ref_datetime_iso`,该时间作为解析相对时间短语的锚点(例如将“昨天”转换为具体的日期范围)。
为保证可复现性,我们还发布了用于构建该Parquet版本的**原始时间感知用户画像**(每个用户画像对应一个文件夹)。每个时间感知用户画像包含:
- `history.csv` — 历史记录行,包含`last_visit_date`(微秒级纪元时间戳)与`frecency`(访问频率得分)
- `query.csv` — 时间感知查询与预期匹配结果
- `temporal_context.json` — 参考时间与区域规则(例如时区、周末定义、日期格式)
> 请查阅以下链接获取详细的时间感知文档与各区域说明:
> https://huggingface.co/datasets/Mozilla/history-search-retrieval/blob/main/raw/profiles/temporal/README.md
---
## 应用场景
- **面向用户的信息检索**(基于单用户画像的历史记录检索)
- **时间感知检索**(例如带有参考时间的用户画像历史检索)
- **基于合成历史轨迹的嵌入与排序评估**
---
## 数据布局
data/v1/
├── docs.parquet
├── queries.parquet
└── qrels.parquet
## 字段说明
- **docs 数据表**:
字段包括:`doc_id`、`url`、`title`、`description`、`frecency`、`last_visit_date`、`profile`(用户画像名称)、`profile_id`(用户画像哈希ID)、`variant`(数据集变体类型)
- **queries 数据表**:
字段包括:`query_id`、`search_query`(搜索查询文本)、`profile`、`profile_id`、`is_temporal`(是否为时间感知查询)、`is_multi`(是否为多标签查询)、`variant`、`ref_datetime_iso`(参考时间ISO格式)
- **qrels 数据表**:
字段包括:`query_id`、`doc_id`、`profile_id`、`relevance`(相关性分值)、`rank`(排序位次)、`variant`
> 其中`profile_id`为每个用户画像文件夹的稳定哈希标识符;`variant`可为`temporal`或其他配置的变体类型;`is_multi`用于标识多标签查询。
---
## Firefox 浏览器书签历史背景(Schema 上下文)
本数据集的`docs`数据表模拟了火狐浏览器的Places数据库结构:
- `moz_places` 数据表Schema:
https://searchfox.org/firefox-main/source/toolkit/components/places/nsPlacesTables.h
- Places工具库中规定的字段长度限制:
`title` ≤ **4096** 字符,`description` ≤ **256** 字符
https://searchfox.org/firefox-main/source/toolkit/components/places/PlacesUtils.sys.mjs#162
- `title` 字段取自DOM的`<title>`标签:
https://searchfox.org/firefox-main/source/dom/svg/SVGTitleElement.cpp
- `description` 字段取自优先级排序的页面元数据:
https://searchfox.org/firefox-main/source/toolkit/actors/ContentMetaChild.sys.mjs#12
> **嵌入文本**:我们使用**`title + description`**作为拼接后的嵌入文本。
---
## 加载方式(基于Hugging Face 数据集库)
以下为使用Hugging Face 数据集库(Hugging Face Datasets)加载该数据集的示例代码:
python
from datasets import load_dataset
dataset_id = "Mozilla/history-search-retrieval"
docs_pd = load_dataset(dataset_id, name="docs")["train"]
queries_pd = load_dataset(dataset_id, name="queries")["train"]
qrels_pd = load_dataset(dataset_id, name="qrels")["train"]
---
## 常用操作
### Hugging Face 数据集库版本操作
#### 列出所有可用用户画像
python
profiles = (docs.to_pandas()[["profile_id","profile","variant"]]
.drop_duplicates()
.sort_values(["profile","variant"]))
#### 按用户画像筛选数据
python
pid = "262e49ec20c32c41"
p_docs = docs.filter(lambda x: x["profile_id"] == pid)
p_queries = queries.filter(lambda x: x["profile_id"] == pid)
p_qrels = qrels.filter(lambda x: x["profile_id"] == pid)
#### 筛选时间感知/多标签查询子集
python
q_temporal = queries.filter(lambda x: x["variant"] == "temporal")
q_multi = queries.filter(lambda x: x["is_multi"])
### Pandas 版本操作
python
import pandas as pd
from datasets import load_dataset
dataset_id = "Mozilla/history-search-retrieval"
docs_pd = load_dataset(dataset_id, name="docs")["train"].to_pandas()
queries_pd = load_dataset(dataset_id, name="queries")["train"].to_pandas()
qrels_pd = load_dataset(dataset_id, name="qrels")["train"].to_pandas()
q = queries_pd[["query_id","search_query","profile_id","variant","is_multi"]].set_index("query_id")
r = qrels_pd[["query_id","doc_id","rank","relevance"]].set_index("query_id")
d = docs_pd[["doc_id","url","title"]].set_index("doc_id")
# 重构(查询 ↔ 文档/URL)配对关系
qr = r.join(q, how="inner").reset_index()
query_pairs = (qr.join(d, on="doc_id", how="left")
.sort_values(["query_id","rank"])
.reset_index(drop=True))
---
## 评估方案
本数据集专为检索任务评估设计。请参阅[代码仓库](https://github.com/mozilla/smart_search)获取评估脚本/Notebook及评估指标实现方式:
- Precision@k(精确率@k)、Recall@k(召回率@k)、nDCG@k(归一化折损累计增益@k)
- 倒数排名(Reciprocal Rank, RR)、平均精度(Average Precision, AP)
- 主题匹配率@k(On-Topic Rate@k)
---
## 合成数据生成流程
本数据集的所有用户画像、查询及相关性对均为**合成生成**。数据生成流程从公开英文文档中生成单用户画像的历史记录,并由大语言模型(Large Language Model, LLM)标注相关性对。
**整体流程**
1. **数据源**:[MS MARCO 文档集](https://microsoft.github.io/msmarco/Datasets.html)(msmarco-docs.tsv);保留前50万条包含`docid`、`url`、`title`、`body`的英文数据。
2. **归一化处理**:构建统一数据表,字段包含`url`、`title`、`description`、`topic`、`lang`、`domain`、`combined_text`
- `description` 取自`body`的前300个字符
- 筛选`title`长度为5-200个字符
- `combined_text = title + " " + description`
3. **采样**:抽取约5万条数据(本Hugging Face版本仅包含英文数据)。
4. **用户画像生成**:创建25个合成用户画像;每个画像采样1000-5000条与其主题匹配的记录;随机设置`frecency`(访问频率得分,范围100-5000)与递增的`last_visit_date`(最后访问时间戳)。
5. **查询与相关性对生成**:使用大语言模型生成针对特定用户画像的查询;基于用户画像历史标注相关性;导出包含`rank`与`relevance=1`的`qrels`数据表,并为每个用户画像生成简洁的`query.csv`文件。
**完整代码与Notebook**:
- https://github.com/mozilla/smart_search/tree/temporal_awareness/preprocessing/generate_profiles
- https://github.com/mozilla/smart_search/blob/temporal_awareness/notebooks/generate_history.ipynb
---
### 附加数据集
本数据集还包含另一套基于公开合成浏览历史构建的数据集:
- 数据源仓库:https://github.com/komosny/synthetic-browsing-history
- 详细信息/论文:https://pmc.ncbi.nlm.nih.gov/articles/PMC11754914/
该数据集面向英语用户,覆盖国家包括:澳大利亚、加拿大、英国、美国。
**预处理流程**
- 按URL去重。
- 按优先级获取页面标题与描述信息:
- 标题获取优先级:`<title>`标签 → `meta[property=og:title]` → `meta[name=twitter:title]` → `<h1>`标签
- 描述信息获取优先级:`meta[name=description]` → `meta[property=og:description]` → `meta[name=twitter:description]` → 页面摘要(summary)
- 遵循火狐浏览器的字段长度限制:`title` ≤ 4096字符,`description` ≤ 256字符。
- 移除同时缺少标题与描述信息的记录。
**查询构建流程**
- 为每个用户画像随机采样50个URL。
- 使用大语言模型(gpt-5-mini)生成50条语义搜索查询,用于检索给定URL,生成条件为该URL的标题与描述信息。
---
## 原始数据层
我们保留了用于复现的`raw/`原始数据目录,但推荐使用Parquet格式的标准化版本进行实验:
raw/profiles/<variant>/<single|multi-label>/<profile>/
├── history.csv
├── query.csv
└── temporal_context.json # 仅时间感知数据集包含该文件
> 所有实验请使用Parquet格式数据。
> `raw/`目录仅作为复现参考,不支持远程加载接口。
提供机构:
Mozilla



