five

Mozilla/history-search-retrieval

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mozilla/history-search-retrieval
下载链接
链接失效反馈
官方服务:
资源简介:
--- viewer: true configs: - config_name: docs data_files: data/v1/docs.parquet - config_name: queries data_files: data/v1/queries.parquet - config_name: qrels data_files: data/v1/qrels.parquet license: mit language: - en --- # Semantic History (Synthetic) **Semantic History** is a synthetic dataset for **semantic history search** research. It ships as three **normalized** Parquet tables: | Split | Description | |-----------|-----------------------------------------------------------------------------------------------------| | `docs` | Search history records with `url`, `title`, `description`, `frecency`, `last_visit_date`, and tags. | | `queries` | One row per query, tagged by profile and temporal/multi-label flags. | | `qrels` | Relevance pairs linking `queries` ↔ `docs` with `rank` and `relevance`. | All content is **synthetic** (no real browsing logs). --- ## Temporal Variant The dataset includes a **temporal** slice designed to test retrieval with time-aware queries (e.g., *“yesterday”*, *“last week”*, *“on Black Friday”*). Temporal queries in `queries` are tagged with `variant="temporal"` and carry a per-query reference time `ref_datetime_iso`, which is the anchor used to resolve relative phrases (e.g., “yesterday” → specific date range). For reproducibility, we also publish the **raw temporal profiles** (one folder per profile) that were used to construct the Parquet release. Each temporal profile contains: - `history.csv` — history rows with `last_visit_date` (µs epoch) and `frecency` - `query.csv` — temporal queries and expected matches - `temporal_context.json` — reference time and locale rules (e.g., timezone, weekend, date format) > See the detailed temporal documentation and per-locale notes here: > https://huggingface.co/datasets/Mozilla/history-search-retrieval/blob/main/raw/profiles/temporal/README.md --- ## Motivation - **User-oriented IR** (per-profile history retrieval) - **Temporal-aware retrieval** (e.g., profile histories with a reference time) - **Embedding & ranking evaluation** on synthetic history traces --- ## Data Layout ``` data/v1/ ├── docs.parquet ├── queries.parquet └── qrels.parquet ``` **Columns** - **docs**: `doc_id, url, title, description, frecency, last_visit_date, profile, profile_id, variant` - **queries**: `query_id, search_query, profile, profile_id, is_temporal, is_multi, variant, ref_datetime_iso` - **qrels**: `query_id, doc_id, profile_id, relevance, rank, variant` > `profile_id` is a stable, hashed identifier per profile folder; `variant` can be `temporal` or other configured variants; `is_multi` indicates multi-label queries. --- ## Firefox Places Background (schema context) The dataset `docs` mimics Firefox’s Places DB: - `moz_places` table schema: https://searchfox.org/firefox-main/source/toolkit/components/places/nsPlacesTables.h - Length limits referenced in Places utils: `title` ≤ **4096** chars, `description` ≤ **256** https://searchfox.org/firefox-main/source/toolkit/components/places/PlacesUtils.sys.mjs#162 - `title` from DOM `<title>`: https://searchfox.org/firefox-main/source/dom/svg/SVGTitleElement.cpp - `description` from prioritized page metadata: https://searchfox.org/firefox-main/source/toolkit/actors/ContentMetaChild.sys.mjs#12 > **Embedding text**: we use **`title + description`**. --- ## How to Load (HF Datasets) ```python from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"] queries_pd = load_dataset(dataset_id, name="queries")["train"] qrels_pd = load_dataset(dataset_id, name="qrels")["train"] ``` --- ## Common Operations ### HF Datasets Version #### List available profiles ```python profiles = (docs.to_pandas()[["profile_id","profile","variant"]] .drop_duplicates() .sort_values(["profile","variant"])) ``` #### Filter by profile ```python pid = "262e49ec20c32c41" p_docs = docs.filter(lambda x: x["profile_id"] == pid) p_queries = queries.filter(lambda x: x["profile_id"] == pid) p_qrels = qrels.filter(lambda x: x["profile_id"] == pid) ``` #### Temporal / multi-label slices ```python q_temporal = queries.filter(lambda x: x["variant"] == "temporal") q_multi = queries.filter(lambda x: x["is_multi"]) ``` --- ### Pandas Version ```python import pandas as pd from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"].to_pandas() queries_pd = load_dataset(dataset_id, name="queries")["train"].to_pandas() qrels_pd = load_dataset(dataset_id, name="qrels")["train"].to_pandas() q = queries_pd[["query_id","search_query","profile_id","variant","is_multi"]].set_index("query_id") r = qrels_pd[["query_id","doc_id","rank","relevance"]].set_index("query_id") d = docs_pd[["doc_id","url","title"]].set_index("doc_id") # Reconstruct (query <-> doc/url) pairs qr = r.join(q, how="inner").reset_index() query_pairs = (qr.join(d, on="doc_id", how="left") .sort_values(["query_id","rank"]) .reset_index(drop=True)) ``` ## Evaluation This dataset is intended for retrieval evaluation. Please see the [repository](https://github.com/mozilla/smart_search) for the evaluation scripts/notebooks and metric implementations: - Precision@k, Recall@k, nDCG@k - Reciprocal Rank (RR), Average Precision (AP) - On-Topic Rate@k ## Synthetic Data Generation All profiles, queries, and qrels are **synthetic**. The pipeline creates per-profile histories and LLM-judged relevance pairs from public English documents. **Overview** 1. **Source**: [MS MARCO documents](https://microsoft.github.io/msmarco/Datasets.html) (msmarco-docs.tsv); keep top 500k rows with `docid,url,title,body` (English). 2. **Normalize**: build a unified table with `url,title,description,topic,lang,domain,combined_text` - description = first 300 chars of `body` - filter titles to length 5-200 - `combined_text = title + " " + description` 3. **Sample**: draw ~50k examples (English-only for this HF release). 4. **Profiles**: create 25 synthetic profiles; for each, sample 1k-5k items aligned with profile themes; set random `frecency` (100-5000) and incremental `last_visit_date`. 5. **Queries & qrels**: generate profile-specific queries with an LLM; judge relevance over the profile history; export `qrels` with `rank` and `relevance=1` and concise `query.csv` per profile. **Code** (full scripts & notebooks): - https://github.com/mozilla/smart_search/tree/temporal_awareness/preprocessing/generate_profiles - https://github.com/mozilla/smart_search/blob/temporal_awareness/notebooks/generate_history.ipynb ### Additional Dataset We also include a second dataset built from publicly available synthetic histories: - Source repo: https://github.com/komosny/synthetic-browsing-history - Details/paper: https://pmc.ncbi.nlm.nih.gov/articles/PMC11754914/ Countries (English-focused): Australia, Canada, United Kingdom, United States. **Preprocessing** - Deduplicate by URL. - Fetch title and description with priorities: - title: `<title>` → `meta[property=og:title]` → `meta[name=twitter:title]` → `<h1>` - description: `meta[name=description]` → `meta[property=og:description]` → `meta[name=twitter:description]` → `summary` - Enforce Firefox-style limits: title ≤ 4096, description ≤ 256. - Drop records with no title and no description. **Query Construction** - For each profile, randomly sample 50 URLs. - Use an LLM (gpt-5-mini) to generate 50 semantic search queries that should retrieve the given URL, conditioning on its title and description. ## Raw Layer We maintain a reference `raw/` tree for repro, but the canonical interface is the Parquet layer: ``` raw/profiles/<variant>/<single|multi-label>/<profile>/ ├── history.csv ├── query.csv └── temporal_context.json # only for temporal ``` > Use Parquet for all experiments. > `raw/` is reference-only; no remote execution loaders.

--- 查看器:启用 数据集配置: - 配置名称:docs,数据文件:data/v1/docs.parquet - 配置名称:queries,数据文件:data/v1/queries.parquet - 配置名称:qrels(查询相关性对),数据文件:data/v1/qrels.parquet 许可证:MIT 语言:英语 --- # 语义历史(合成数据集) **语义历史(Semantic History)** 是一款用于**语义历史搜索**研究的合成数据集。该数据集以三张**归一化** Parquet 数据表的形式发布: | 拆分名称 | 描述 | |-----------|-----------------------------------------------------------------------------------------------------| | `docs` | 包含`url`、`title`、`description`、`frecency`、`last_visit_date`及标签的搜索历史记录。 | | `queries` | 每条查询对应一行数据,通过用户画像与时间感知/多标签标记进行标注。 | | `qrels` | 连接`queries`与`docs`的相关性对,包含`rank`(排序位次)与`relevance`(相关性分值)。 | 所有内容均为**合成生成**(无真实浏览日志)。 --- ## 时间感知变体 该数据集包含一个**时间感知**子集,用于测试基于时间相关查询的检索任务(例如:*“昨天”*、*“上周”*、*“黑色星期五当天”*)。`queries` 中的时间感知查询会被标记为`variant="temporal"`,并附带每条查询的参考时间`ref_datetime_iso`,该时间作为解析相对时间短语的锚点(例如将“昨天”转换为具体的日期范围)。 为保证可复现性,我们还发布了用于构建该Parquet版本的**原始时间感知用户画像**(每个用户画像对应一个文件夹)。每个时间感知用户画像包含: - `history.csv` — 历史记录行,包含`last_visit_date`(微秒级纪元时间戳)与`frecency`(访问频率得分) - `query.csv` — 时间感知查询与预期匹配结果 - `temporal_context.json` — 参考时间与区域规则(例如时区、周末定义、日期格式) > 请查阅以下链接获取详细的时间感知文档与各区域说明: > https://huggingface.co/datasets/Mozilla/history-search-retrieval/blob/main/raw/profiles/temporal/README.md --- ## 应用场景 - **面向用户的信息检索**(基于单用户画像的历史记录检索) - **时间感知检索**(例如带有参考时间的用户画像历史检索) - **基于合成历史轨迹的嵌入与排序评估** --- ## 数据布局 data/v1/ ├── docs.parquet ├── queries.parquet └── qrels.parquet ## 字段说明 - **docs 数据表**: 字段包括:`doc_id`、`url`、`title`、`description`、`frecency`、`last_visit_date`、`profile`(用户画像名称)、`profile_id`(用户画像哈希ID)、`variant`(数据集变体类型) - **queries 数据表**: 字段包括:`query_id`、`search_query`(搜索查询文本)、`profile`、`profile_id`、`is_temporal`(是否为时间感知查询)、`is_multi`(是否为多标签查询)、`variant`、`ref_datetime_iso`(参考时间ISO格式) - **qrels 数据表**: 字段包括:`query_id`、`doc_id`、`profile_id`、`relevance`(相关性分值)、`rank`(排序位次)、`variant` > 其中`profile_id`为每个用户画像文件夹的稳定哈希标识符;`variant`可为`temporal`或其他配置的变体类型;`is_multi`用于标识多标签查询。 --- ## Firefox 浏览器书签历史背景(Schema 上下文) 本数据集的`docs`数据表模拟了火狐浏览器的Places数据库结构: - `moz_places` 数据表Schema: https://searchfox.org/firefox-main/source/toolkit/components/places/nsPlacesTables.h - Places工具库中规定的字段长度限制: `title` ≤ **4096** 字符,`description` ≤ **256** 字符 https://searchfox.org/firefox-main/source/toolkit/components/places/PlacesUtils.sys.mjs#162 - `title` 字段取自DOM的`<title>`标签: https://searchfox.org/firefox-main/source/dom/svg/SVGTitleElement.cpp - `description` 字段取自优先级排序的页面元数据: https://searchfox.org/firefox-main/source/toolkit/actors/ContentMetaChild.sys.mjs#12 > **嵌入文本**:我们使用**`title + description`**作为拼接后的嵌入文本。 --- ## 加载方式(基于Hugging Face 数据集库) 以下为使用Hugging Face 数据集库(Hugging Face Datasets)加载该数据集的示例代码: python from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"] queries_pd = load_dataset(dataset_id, name="queries")["train"] qrels_pd = load_dataset(dataset_id, name="qrels")["train"] --- ## 常用操作 ### Hugging Face 数据集库版本操作 #### 列出所有可用用户画像 python profiles = (docs.to_pandas()[["profile_id","profile","variant"]] .drop_duplicates() .sort_values(["profile","variant"])) #### 按用户画像筛选数据 python pid = "262e49ec20c32c41" p_docs = docs.filter(lambda x: x["profile_id"] == pid) p_queries = queries.filter(lambda x: x["profile_id"] == pid) p_qrels = qrels.filter(lambda x: x["profile_id"] == pid) #### 筛选时间感知/多标签查询子集 python q_temporal = queries.filter(lambda x: x["variant"] == "temporal") q_multi = queries.filter(lambda x: x["is_multi"]) ### Pandas 版本操作 python import pandas as pd from datasets import load_dataset dataset_id = "Mozilla/history-search-retrieval" docs_pd = load_dataset(dataset_id, name="docs")["train"].to_pandas() queries_pd = load_dataset(dataset_id, name="queries")["train"].to_pandas() qrels_pd = load_dataset(dataset_id, name="qrels")["train"].to_pandas() q = queries_pd[["query_id","search_query","profile_id","variant","is_multi"]].set_index("query_id") r = qrels_pd[["query_id","doc_id","rank","relevance"]].set_index("query_id") d = docs_pd[["doc_id","url","title"]].set_index("doc_id") # 重构(查询 ↔ 文档/URL)配对关系 qr = r.join(q, how="inner").reset_index() query_pairs = (qr.join(d, on="doc_id", how="left") .sort_values(["query_id","rank"]) .reset_index(drop=True)) --- ## 评估方案 本数据集专为检索任务评估设计。请参阅[代码仓库](https://github.com/mozilla/smart_search)获取评估脚本/Notebook及评估指标实现方式: - Precision@k(精确率@k)、Recall@k(召回率@k)、nDCG@k(归一化折损累计增益@k) - 倒数排名(Reciprocal Rank, RR)、平均精度(Average Precision, AP) - 主题匹配率@k(On-Topic Rate@k) --- ## 合成数据生成流程 本数据集的所有用户画像、查询及相关性对均为**合成生成**。数据生成流程从公开英文文档中生成单用户画像的历史记录,并由大语言模型(Large Language Model, LLM)标注相关性对。 **整体流程** 1. **数据源**:[MS MARCO 文档集](https://microsoft.github.io/msmarco/Datasets.html)(msmarco-docs.tsv);保留前50万条包含`docid`、`url`、`title`、`body`的英文数据。 2. **归一化处理**:构建统一数据表,字段包含`url`、`title`、`description`、`topic`、`lang`、`domain`、`combined_text` - `description` 取自`body`的前300个字符 - 筛选`title`长度为5-200个字符 - `combined_text = title + " " + description` 3. **采样**:抽取约5万条数据(本Hugging Face版本仅包含英文数据)。 4. **用户画像生成**:创建25个合成用户画像;每个画像采样1000-5000条与其主题匹配的记录;随机设置`frecency`(访问频率得分,范围100-5000)与递增的`last_visit_date`(最后访问时间戳)。 5. **查询与相关性对生成**:使用大语言模型生成针对特定用户画像的查询;基于用户画像历史标注相关性;导出包含`rank`与`relevance=1`的`qrels`数据表,并为每个用户画像生成简洁的`query.csv`文件。 **完整代码与Notebook**: - https://github.com/mozilla/smart_search/tree/temporal_awareness/preprocessing/generate_profiles - https://github.com/mozilla/smart_search/blob/temporal_awareness/notebooks/generate_history.ipynb --- ### 附加数据集 本数据集还包含另一套基于公开合成浏览历史构建的数据集: - 数据源仓库:https://github.com/komosny/synthetic-browsing-history - 详细信息/论文:https://pmc.ncbi.nlm.nih.gov/articles/PMC11754914/ 该数据集面向英语用户,覆盖国家包括:澳大利亚、加拿大、英国、美国。 **预处理流程** - 按URL去重。 - 按优先级获取页面标题与描述信息: - 标题获取优先级:`<title>`标签 → `meta[property=og:title]` → `meta[name=twitter:title]` → `<h1>`标签 - 描述信息获取优先级:`meta[name=description]` → `meta[property=og:description]` → `meta[name=twitter:description]` → 页面摘要(summary) - 遵循火狐浏览器的字段长度限制:`title` ≤ 4096字符,`description` ≤ 256字符。 - 移除同时缺少标题与描述信息的记录。 **查询构建流程** - 为每个用户画像随机采样50个URL。 - 使用大语言模型(gpt-5-mini)生成50条语义搜索查询,用于检索给定URL,生成条件为该URL的标题与描述信息。 --- ## 原始数据层 我们保留了用于复现的`raw/`原始数据目录,但推荐使用Parquet格式的标准化版本进行实验: raw/profiles/<variant>/<single|multi-label>/<profile>/ ├── history.csv ├── query.csv └── temporal_context.json # 仅时间感知数据集包含该文件 > 所有实验请使用Parquet格式数据。 > `raw/`目录仅作为复现参考,不支持远程加载接口。
提供机构:
Mozilla
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作