Mozilla/history-search-retrieval

Name: Mozilla/history-search-retrieval
Creator: Mozilla
Published: 2026-04-13 16:13:13
License: 暂无描述

Hugging Face2026-04-13 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Mozilla/history-search-retrieval

下载链接

链接失效反馈

官方服务：

资源简介：

--- viewer: true configs: - config_name: docs data_files: data/v1/docs.parquet - config_name: queries data_files: data/v1/queries.parquet - config_name: qrels data_files: data/v1/qrels.parquet license: mit language: - en --- # Semantic History (Synthetic) **Semantic History** is a synthetic dataset for **semantic history search** research. It ships as three **normalized** Parquet tables: | Split | Description | |-----------|-----------------------------------------------------------------------------------------------------| | `docs` | Search history records with `url`, `title`, `description`, `frecency`, `last_visit_date`, and tags. | | `queries` | One row per query, tagged by profile and temporal/multi-label flags. | | `qrels` | Relevance pairs linking `queries` ↔ `docs` with `rank` and `relevance`. | All content is **synthetic** (no real browsing logs). --- ## Temporal Variant The dataset includes a **temporal** slice designed to test retrieval with time-aware queries (e.g., *“yesterday”*, *“last week”*, *“on Black Friday”*). Temporal queries in `queries` are tagged with `variant="temporal"` and carry a per-query reference time `ref_datetime_iso`, which is the anchor used to resolve relative phrases (e.g., “yesterday” → specific date range). For reproducibility, we also publish the **raw temporal profiles** (one folder per profile) that were used to construct the Parquet release. Each temporal profile contains: - `history.csv` — history rows with `last_visit_date` (µs epoch) and `frecency` - `query.csv` — temporal queries and expected matches - `temporal_context.json` — reference time and locale rules (e.g., timezone, weekend, date format) > See the detailed temporal documentation and per-locale notes here: > https://huggingface.co/datasets/Mozilla/history-search-retrieval/blob/main/raw/profiles/temporal/README.md --- ## Motivation - **User-oriented IR** (per-profile history retrieval) - **Temporal-aware retrieval** (e.g., profile histories with a reference time) - **Embedding & ranking evaluation** on synthetic history traces --- ## Data Layout ``` data/v1/ ├── docs.parquet ├── queries.parquet └── qrels.parquet ``` **Columns** - **docs**: `doc_id, url, title, description, frecency, last_visit_date, profile, profile_id, variant` - **queries**: `query_id, search_query, profile, profile_id, is_temporal, is_multi, variant, ref_datetime_iso` - **qrels**: `query_id, doc_id, profile_id, relevance, rank, variant` > `profile_id` is a stable, hashed identifier per profile folder; `variant` can be `temporal` or other configured variants; `is_multi` indicates multi-label queries. --- ## Firefox Places Background (schema context) The dataset `docs` mimics Firefox’s Places DB: - `moz_places` table schema: https://searchfox.org/firefox-main/source/toolkit/components/places/nsPlacesTables.h - Length limits referenced in Places utils: `title` ≤ **4096** chars, `description` ≤ **256** https://searchfox.org/firefox-main/source/toolkit/components/places/PlacesUtils.sys.mjs#162 - `title` from DOM `<title>`: https://searchfox.org/firefox-main/source/dom/svg/SVGTitleElement.cpp - `description` from prioritized page metadata: https://searchfox.org/firefox-main/source/toolkit/actors/ContentMetaChild.sys.mjs#12 > **Embedding text**: we use **`title + description`**. --- ## How to Load (HF Datasets) ```python from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"] queries_pd = load_dataset(dataset_id, name="queries")["train"] qrels_pd = load_dataset(dataset_id, name="qrels")["train"] ``` --- ## Common Operations ### HF Datasets Version #### List available profiles ```python profiles = (docs.to_pandas()[["profile_id","profile","variant"]] .drop_duplicates() .sort_values(["profile","variant"])) ``` #### Filter by profile ```python pid = "262e49ec20c32c41" p_docs = docs.filter(lambda x: x["profile_id"] == pid) p_queries = queries.filter(lambda x: x["profile_id"] == pid) p_qrels = qrels.filter(lambda x: x["profile_id"] == pid) ``` #### Temporal / multi-label slices ```python q_temporal = queries.filter(lambda x: x["variant"] == "temporal") q_multi = queries.filter(lambda x: x["is_multi"]) ``` --- ### Pandas Version ```python import pandas as pd from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"].to_pandas() queries_pd = load_dataset(dataset_id, name="queries")["train"].to_pandas() qrels_pd = load_dataset(dataset_id, name="qrels")["train"].to_pandas() q = queries_pd[["query_id","search_query","profile_id","variant","is_multi"]].set_index("query_id") r = qrels_pd[["query_id","doc_id","rank","relevance"]].set_index("query_id") d = docs_pd[["doc_id","url","title"]].set_index("doc_id") # Reconstruct (query <-> doc/url) pairs qr = r.join(q, how="inner").reset_index() query_pairs = (qr.join(d, on="doc_id", how="left") .sort_values(["query_id","rank"]) .reset_index(drop=True)) ``` ## Evaluation This dataset is intended for retrieval evaluation. Please see the [repository](https://github.com/mozilla/smart_search) for the evaluation scripts/notebooks and metric implementations: - Precision@k, Recall@k, nDCG@k - Reciprocal Rank (RR), Average Precision (AP) - On-Topic Rate@k ## Synthetic Data Generation All profiles, queries, and qrels are **synthetic**. The pipeline creates per-profile histories and LLM-judged relevance pairs from public English documents. **Overview** 1. **Source**: [MS MARCO documents](https://microsoft.github.io/msmarco/Datasets.html) (msmarco-docs.tsv); keep top 500k rows with `docid,url,title,body` (English). 2. **Normalize**: build a unified table with `url,title,description,topic,lang,domain,combined_text` - description = first 300 chars of `body` - filter titles to length 5-200 - `combined_text = title + " " + description` 3. **Sample**: draw ~50k examples (English-only for this HF release). 4. **Profiles**: create 25 synthetic profiles; for each, sample 1k-5k items aligned with profile themes; set random `frecency` (100-5000) and incremental `last_visit_date`. 5. **Queries & qrels**: generate profile-specific queries with an LLM; judge relevance over the profile history; export `qrels` with `rank` and `relevance=1` and concise `query.csv` per profile. **Code** (full scripts & notebooks): - https://github.com/mozilla/smart_search/tree/temporal_awareness/preprocessing/generate_profiles - https://github.com/mozilla/smart_search/blob/temporal_awareness/notebooks/generate_history.ipynb ### Additional Dataset We also include a second dataset built from publicly available synthetic histories: - Source repo: https://github.com/komosny/synthetic-browsing-history - Details/paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11754914/ Countries (English-focused): Australia, Canada, United Kingdom, United States. **Preprocessing** - Deduplicate by URL. - Fetch title and description with priorities: - title: `<title>` → `meta[property=og:title]` → `meta[name=twitter:title]` → `<h1>` - description: `meta[name=description]` → `meta[property=og:description]` → `meta[name=twitter:description]` → `summary` - Enforce Firefox-style limits: title ≤ 4096, description ≤ 256. - Drop records with no title and no description. **Query Construction** - For each profile, randomly sample 50 URLs. - Use an LLM (gpt-5-mini) to generate 50 semantic search queries that should retrieve the given URL, conditioning on its title and description. ## Raw Layer We maintain a reference `raw/` tree for repro, but the canonical interface is the Parquet layer: ``` raw/profiles/<variant>/<single|multi-label>/<profile>/ ├── history.csv ├── query.csv └── temporal_context.json # only for temporal ``` > Use Parquet for all experiments. > `raw/` is reference-only; no remote execution loaders.

--- 查看器：启用数据集配置： - 配置名称：docs，数据文件：data/v1/docs.parquet - 配置名称：queries，数据文件：data/v1/queries.parquet - 配置名称：qrels（查询相关性对），数据文件：data/v1/qrels.parquet 许可证：MIT 语言：英语 --- # 语义历史（合成数据集） **语义历史（Semantic History）** 是一款用于**语义历史搜索**研究的合成数据集。该数据集以三张**归一化** Parquet 数据表的形式发布： | 拆分名称 | 描述 | |-----------|-----------------------------------------------------------------------------------------------------| | `docs` | 包含`url`、`title`、`description`、`frecency`、`last_visit_date`及标签的搜索历史记录。 | | `queries` | 每条查询对应一行数据，通过用户画像与时间感知/多标签标记进行标注。 | | `qrels` | 连接`queries`与`docs`的相关性对，包含`rank`（排序位次）与`relevance`（相关性分值）。 | 所有内容均为**合成生成**（无真实浏览日志）。 --- ## 时间感知变体该数据集包含一个**时间感知**子集，用于测试基于时间相关查询的检索任务（例如：*“昨天”*、*“上周”*、*“黑色星期五当天”*）。`queries` 中的时间感知查询会被标记为`variant="temporal"`，并附带每条查询的参考时间`ref_datetime_iso`，该时间作为解析相对时间短语的锚点（例如将“昨天”转换为具体的日期范围）。为保证可复现性，我们还发布了用于构建该Parquet版本的**原始时间感知用户画像**（每个用户画像对应一个文件夹）。每个时间感知用户画像包含： - `history.csv` — 历史记录行，包含`last_visit_date`（微秒级纪元时间戳）与`frecency`（访问频率得分） - `query.csv` — 时间感知查询与预期匹配结果 - `temporal_context.json` — 参考时间与区域规则（例如时区、周末定义、日期格式） > 请查阅以下链接获取详细的时间感知文档与各区域说明： > https://huggingface.co/datasets/Mozilla/history-search-retrieval/blob/main/raw/profiles/temporal/README.md --- ## 应用场景 - **面向用户的信息检索**（基于单用户画像的历史记录检索） - **时间感知检索**（例如带有参考时间的用户画像历史检索） - **基于合成历史轨迹的嵌入与排序评估** --- ## 数据布局 data/v1/ ├── docs.parquet ├── queries.parquet └── qrels.parquet ## 字段说明 - **docs 数据表**：字段包括：`doc_id`、`url`、`title`、`description`、`frecency`、`last_visit_date`、`profile`（用户画像名称）、`profile_id`（用户画像哈希ID）、`variant`（数据集变体类型） - **queries 数据表**：字段包括：`query_id`、`search_query`（搜索查询文本）、`profile`、`profile_id`、`is_temporal`（是否为时间感知查询）、`is_multi`（是否为多标签查询）、`variant`、`ref_datetime_iso`（参考时间ISO格式） - **qrels 数据表**：字段包括：`query_id`、`doc_id`、`profile_id`、`relevance`（相关性分值）、`rank`（排序位次）、`variant` > 其中`profile_id`为每个用户画像文件夹的稳定哈希标识符；`variant`可为`temporal`或其他配置的变体类型；`is_multi`用于标识多标签查询。 --- ## Firefox 浏览器书签历史背景（Schema 上下文）本数据集的`docs`数据表模拟了火狐浏览器的Places数据库结构： - `moz_places` 数据表Schema： https://searchfox.org/firefox-main/source/toolkit/components/places/nsPlacesTables.h - Places工具库中规定的字段长度限制： `title` ≤ **4096** 字符，`description` ≤ **256** 字符 https://searchfox.org/firefox-main/source/toolkit/components/places/PlacesUtils.sys.mjs#162 - `title` 字段取自DOM的`<title>`标签： https://searchfox.org/firefox-main/source/dom/svg/SVGTitleElement.cpp - `description` 字段取自优先级排序的页面元数据： https://searchfox.org/firefox-main/source/toolkit/actors/ContentMetaChild.sys.mjs#12 > **嵌入文本**：我们使用**`title + description`**作为拼接后的嵌入文本。 --- ## 加载方式（基于Hugging Face 数据集库）以下为使用Hugging Face 数据集库（Hugging Face Datasets）加载该数据集的示例代码： python from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"] queries_pd = load_dataset(dataset_id, name="queries")["train"] qrels_pd = load_dataset(dataset_id, name="qrels")["train"] --- ## 常用操作 ### Hugging Face 数据集库版本操作 #### 列出所有可用用户画像 python profiles = (docs.to_pandas()[["profile_id","profile","variant"]] .drop_duplicates() .sort_values(["profile","variant"])) #### 按用户画像筛选数据 python pid = "262e49ec20c32c41" p_docs = docs.filter(lambda x: x["profile_id"] == pid) p_queries = queries.filter(lambda x: x["profile_id"] == pid) p_qrels = qrels.filter(lambda x: x["profile_id"] == pid) #### 筛选时间感知/多标签查询子集 python q_temporal = queries.filter(lambda x: x["variant"] == "temporal") q_multi = queries.filter(lambda x: x["is_multi"]) ### Pandas 版本操作 python import pandas as pd from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"].to_pandas() queries_pd = load_dataset(dataset_id, name="queries")["train"].to_pandas() qrels_pd = load_dataset(dataset_id, name="qrels")["train"].to_pandas() q = queries_pd[["query_id","search_query","profile_id","variant","is_multi"]].set_index("query_id") r = qrels_pd[["query_id","doc_id","rank","relevance"]].set_index("query_id") d = docs_pd[["doc_id","url","title"]].set_index("doc_id") # 重构（查询 ↔ 文档/URL）配对关系 qr = r.join(q, how="inner").reset_index() query_pairs = (qr.join(d, on="doc_id", how="left") .sort_values(["query_id","rank"]) .reset_index(drop=True)) --- ## 评估方案本数据集专为检索任务评估设计。请参阅[代码仓库](https://github.com/mozilla/smart_search)获取评估脚本/Notebook及评估指标实现方式： - Precision@k（精确率@k）、Recall@k（召回率@k）、nDCG@k（归一化折损累计增益@k） - 倒数排名（Reciprocal Rank, RR）、平均精度（Average Precision, AP） - 主题匹配率@k（On-Topic Rate@k） --- ## 合成数据生成流程本数据集的所有用户画像、查询及相关性对均为**合成生成**。数据生成流程从公开英文文档中生成单用户画像的历史记录，并由大语言模型（Large Language Model, LLM）标注相关性对。 **整体流程** 1. **数据源**：[MS MARCO 文档集](https://microsoft.github.io/msmarco/Datasets.html)（msmarco-docs.tsv）；保留前50万条包含`docid`、`url`、`title`、`body`的英文数据。 2. **归一化处理**：构建统一数据表，字段包含`url`、`title`、`description`、`topic`、`lang`、`domain`、`combined_text` - `description` 取自`body`的前300个字符 - 筛选`title`长度为5-200个字符 - `combined_text = title + " " + description` 3. **采样**：抽取约5万条数据（本Hugging Face版本仅包含英文数据）。 4. **用户画像生成**：创建25个合成用户画像；每个画像采样1000-5000条与其主题匹配的记录；随机设置`frecency`（访问频率得分，范围100-5000）与递增的`last_visit_date`（最后访问时间戳）。 5. **查询与相关性对生成**：使用大语言模型生成针对特定用户画像的查询；基于用户画像历史标注相关性；导出包含`rank`与`relevance=1`的`qrels`数据表，并为每个用户画像生成简洁的`query.csv`文件。 **完整代码与Notebook**： - https://github.com/mozilla/smart_search/tree/temporal_awareness/preprocessing/generate_profiles - https://github.com/mozilla/smart_search/blob/temporal_awareness/notebooks/generate_history.ipynb --- ### 附加数据集本数据集还包含另一套基于公开合成浏览历史构建的数据集： - 数据源仓库：https://github.com/komosny/synthetic-browsing-history - 详细信息/论文：https://pmc.ncbi.nlm.nih.gov/articles/PMC11754914/ 该数据集面向英语用户，覆盖国家包括：澳大利亚、加拿大、英国、美国。 **预处理流程** - 按URL去重。 - 按优先级获取页面标题与描述信息： - 标题获取优先级：`<title>`标签 → `meta[property=og:title]` → `meta[name=twitter:title]` → `<h1>`标签 - 描述信息获取优先级：`meta[name=description]` → `meta[property=og:description]` → `meta[name=twitter:description]` → 页面摘要（summary） - 遵循火狐浏览器的字段长度限制：`title` ≤ 4096字符，`description` ≤ 256字符。 - 移除同时缺少标题与描述信息的记录。 **查询构建流程** - 为每个用户画像随机采样50个URL。 - 使用大语言模型（gpt-5-mini）生成50条语义搜索查询，用于检索给定URL，生成条件为该URL的标题与描述信息。 --- ## 原始数据层我们保留了用于复现的`raw/`原始数据目录，但推荐使用Parquet格式的标准化版本进行实验： raw/profiles/<variant>/<single|multi-label>/<profile>/ ├── history.csv ├── query.csv └── temporal_context.json # 仅时间感知数据集包含该文件 > 所有实验请使用Parquet格式数据。 > `raw/`目录仅作为复现参考，不支持远程加载接口。

提供机构：

Mozilla

5,000+

优质数据集

54 个

任务类型

进入经典数据集