allenai/asta-user-interactions

Name: allenai/asta-user-interactions
Creator: allenai
Published: 2026-02-27 20:10:32
License: 暂无描述

Hugging Face2026-02-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/allenai/asta-user-interactions

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by configs: - config_name: optin_queries data_files: - split: train path: data/optin_queries_anonymized.parquet - config_name: section_expansions data_files: - split: train path: data/section_expansions_anonymized.parquet - config_name: s2_link_clicks data_files: - split: train path: data/s2_link_clicks_anonymized.parquet - config_name: report_section_titles data_files: - split: train path: data/report_section_titles_anonymized.parquet - config_name: report_corpus_ids data_files: - split: train path: data/report_corpus_ids_anonymized.parquet - config_name: pf_shown_results data_files: - split: train path: data/pf_shown_results_anonymized.parquet --- # ScholarQA and Paper Finder User Interaction Dataset This dataset contains anonymized user interaction data from two AI-powered research tools: **ScholarQA (SQA)** and **Paper Finder (PF)**. The data comes from users who opted in to having their queries and interaction data released. ## Usage ```python from datasets import load_dataset queries = load_dataset("allenai/asta-user-interactions", "optin_queries")["train"] clicks = load_dataset("allenai/asta-user-interactions", "s2_link_clicks")["train"] shown_results = load_dataset("allenai/asta-user-interactions", "pf_shown_results")["train"] ``` Available configs: `optin_queries`, `section_expansions`, `s2_link_clicks`, `report_section_titles`, `report_corpus_ids`, `pf_shown_results`. ## Tools **ScholarQA** is a conversational research assistant that generates comprehensive literature review reports in response to user queries. Reports are organized into sections, with each section containing synthesized information and citations to relevant papers. **Paper Finder** is a semantic paper search tool that returns ranked lists of relevant papers based on user queries. ## Dataset Overview | File | Description | |------|-------------| | `optin_queries_anonymized.parquet` | User queries submitted to both tools | | `section_expansions_anonymized.parquet` | Section expand clicks in SQA reports | | `s2_link_clicks_anonymized.parquet` | Clicks on Semantic Scholar paper links | | `report_section_titles_anonymized.parquet` | Section titles generated in SQA reports | | `report_corpus_ids_anonymized.parquet` | Papers cited in SQA report sections | | `pf_shown_results_anonymized.parquet` | Papers shown in Paper Finder search results | ## Summary Statistics | Metric | SQA | PF | |--------|-----|-----| | Threads (queries) | 127,465 | 131,534 | | Paper link clicks | 24,304 | 181,854 | | Threads with clicks | 5,860 | 47,894 | | Papers referenced | 2,066,393 | 3,858,774 | Additional statistics: - Section expansions: 226,004 events across 33,408 threads - Unique section titles: 500,255 ## Data Collection Period Original collection window: January 1, 2025 to August 26, 2025 Date ranges by dataset: | Dataset | Earliest | Latest | |---------|----------|--------| | optin_queries | 2025-02-25 | 2025-08-27 | | section_expansions | 2025-06-03 | 2025-08-27 | | s2_link_clicks | 2025-03-26 | 2025-08-27 | | pf_shown_results | 2025-03-19 | 2025-08-27 | Note: Actual date ranges are output to `anonymization_stats.json` when the pipeline runs. ## File Descriptions ### optin_queries_anonymized.parquet User queries submitted to ScholarQA and Paper Finder. | Column | Type | Description | |--------|------|-------------| | `query` | string | The user's query text | | `thread_id` | string | Hashed identifier for the conversation thread | | `query_ts` | timestamp | When the query was submitted | | `tool` | string | Which tool was used: `sqa` or `pf` | ### section_expansions_anonymized.parquet Records of users expanding (clicking to view) sections in ScholarQA reports. ScholarQA reports are displayed with sections collapsed by default; this table captures when users click to expand and read a section. | Column | Type | Description | |--------|------|-------------| | `thread_id` | string | Hashed identifier for the conversation thread | | `section_expand_ts` | timestamp | When the section was expanded | | `section_id` | int | Index of the expanded section (0-indexed) | ### s2_link_clicks_anonymized.parquet Clicks on Semantic Scholar paper links within either tool. These represent explicit user interest in viewing a cited or retrieved paper. | Column | Type | Description | |--------|------|-------------| | `thread_id` | string | Hashed identifier for the conversation thread | | `s2_link_click_ts` | timestamp | When the link was clicked | | `corpus_id` | int | Semantic Scholar corpus ID of the clicked paper | | `tool` | string | Which tool the click occurred in: `sqa` or `pf` | ### report_section_titles_anonymized.parquet Section titles from ScholarQA generated reports. Each report contains multiple sections, and this table maps section indices to their titles. | Column | Type | Description | |--------|------|-------------| | `thread_id` | string | Hashed identifier for the conversation thread | | `section_idx` | int | Index of the section within the report (0-indexed) | | `section_title` | string | Title of the section | ### report_corpus_ids_anonymized.parquet Papers cited in ScholarQA report sections. This represents the set of papers that ScholarQA retrieved and cited when generating a report. | Column | Type | Description | |--------|------|-------------| | `thread_id` | string | Hashed identifier for the conversation thread | | `corpus_id` | int | Semantic Scholar corpus ID of the cited paper | ### pf_shown_results_anonymized.parquet Papers shown in Paper Finder search results. Each row represents a paper that was displayed to the user at a specific position in the results list. | Column | Type | Description | |--------|------|-------------| | `thread_id` | string | Hashed identifier for the conversation thread | | `query_ts` | timestamp | When the query was submitted | | `thread_pf_query_num` | int | Query number within the thread (1 = first query) | | `result_position` | int | Position in the results list (0-indexed) | | `corpus_id` | int | Semantic Scholar corpus ID of the shown paper | ## Joining Datasets All datasets can be joined using `thread_id` as the primary key. The hashed thread IDs are consistent across all tables, enabling the same joins as the original data. Common analyses include: - **Clicked sections**: Join `section_expansions_anonymized` with `report_section_titles_anonymized` on `thread_id` and `section_id`/`section_idx` to see which section titles users clicked - **Clicked citations**: Join `s2_link_clicks_anonymized` (filtered to `tool='sqa'`) with `report_corpus_ids_anonymized` on `thread_id` and `corpus_id` to identify which cited papers users clicked - **Paper Finder click-through**: Join `pf_shown_results_anonymized` with `s2_link_clicks_anonymized` (filtered to `tool='pf'`) on `thread_id` and `corpus_id` to compute click-through rates by position ## Citation [TODO: Add citation information] ## License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).

--- license: ODC-BY 配置项: - 配置名称: optin_queries 数据文件: - 拆分方式: 训练集 - 路径: data/optin_queries_anonymized.parquet - 配置名称: section_expansions 数据文件: - 拆分方式: 训练集 - 路径: data/section_expansions_anonymized.parquet - 配置名称: s2_link_clicks 数据文件: - 拆分方式: 训练集 - 路径: data/s2_link_clicks_anonymized.parquet - 配置名称: report_section_titles 数据文件: - 拆分方式: 训练集 - 路径: data/report_section_titles_anonymized.parquet - 配置名称: report_corpus_ids 数据文件: - 拆分方式: 训练集 - 路径: data/report_corpus_ids_anonymized.parquet - 配置名称: pf_shown_results 数据文件: - 拆分方式: 训练集 - 路径: data/pf_shown_results_anonymized.parquet --- # ScholarQA与Paper Finder用户交互数据集本数据集包含两款人工智能驱动的科研工具**ScholarQA（SQA）**与**Paper Finder（PF）**的匿名化用户交互数据，数据源自自愿同意公开其查询与交互数据的用户。 ## 使用方法 python from datasets import load_dataset queries = load_dataset("allenai/asta-user-interactions", "optin_queries")["train"] clicks = load_dataset("allenai/asta-user-interactions", "s2_link_clicks")["train"] shown_results = load_dataset("allenai/asta-user-interactions", "pf_shown_results")["train"] 可用配置项包括：`optin_queries`、`section_expansions`、`s2_link_clicks`、`report_section_titles`、`report_corpus_ids`、`pf_shown_results`。 ## 工具介绍 **ScholarQA（SQA）**是一款对话式科研助手，可根据用户查询生成全面的文献综述报告。报告按章节组织，每个章节包含整合后的信息与相关论文的引用。 **Paper Finder（PF）**是一款语义化论文搜索工具，可根据用户查询返回按相关性排序的相关论文列表。 ## 数据集概览 | 数据文件 | 描述 | |------|-------------| | `optin_queries_anonymized.parquet` | 提交至两款工具的用户查询 | | `section_expansions_anonymized.parquet` | ScholarQA报告中的章节展开点击行为 | | `s2_link_clicks_anonymized.parquet` | 语义学者（Semantic Scholar）论文链接点击行为 | | `report_section_titles_anonymized.parquet` | ScholarQA报告中生成的章节标题 | | `report_corpus_ids_anonymized.parquet` | ScholarQA报告章节中引用的论文 | | `pf_shown_results_anonymized.parquet` | Paper Finder搜索结果中展示的论文 | ## 统计摘要 | 指标 | SQA | PF | |--------|-----|-----| | 对话线程（查询数） | 127,465 | 131,534 | | 论文链接点击量 | 24,304 | 181,854 | | 产生点击的对话线程数 | 5,860 | 47,894 | | 引用/涉及的论文数 | 2,066,393 | 3,858,774 | 补充统计信息： - 章节展开事件：共33,408个对话线程中产生226,004次章节展开事件 - 唯一章节标题数：500,255 ## 数据收集周期原始收集窗口：2025年1月1日至2025年8月26日各数据集的时间范围： | 数据集 | 最早时间 | 最晚时间 | |---------|----------|--------| | optin_queries | 2025-02-25 | 2025-08-27 | | section_expansions | 2025-06-03 | 2025-08-27 | | s2_link_clicks | 2025-03-26 | 2025-08-27 | | pf_shown_results | 2025-03-19 | 2025-08-27 | 注：实际的时间范围将在数据处理流水线运行时输出至`anonymization_stats.json`文件。 ## 文件详细说明 ### optin_queries_anonymized.parquet 提交至ScholarQA与Paper Finder的用户查询数据。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `query` | 字符串 | 用户的查询文本 | | `thread_id` | 字符串 | 对话线程的哈希标识符 | | `query_ts` | 时间戳 | 查询提交的时间 | | `tool` | 字符串 | 使用的工具：`sqa`或`pf` | ### section_expansions_anonymized.parquet 记录用户在ScholarQA报告中展开（点击查看）章节的行为。ScholarQA报告默认折叠所有章节，本表格记录用户点击展开并阅读章节的时间与信息。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `thread_id` | 字符串 | 对话线程的哈希标识符 | | `section_expand_ts` | 时间戳 | 章节展开的时间 | | `section_id` | 整数 | 展开章节的索引（从0开始） | ### s2_link_clicks_anonymized.parquet 记录用户在任意一款工具中点击语义学者（Semantic Scholar）论文链接的行为，代表用户对已引用或检索到的论文的明确访问意愿。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `thread_id` | 字符串 | 对话线程的哈希标识符 | | `s2_link_click_ts` | 时间戳 | 链接点击的时间 | | `corpus_id` | 整数 | 被点击论文的语义学者语料库ID | | `tool` | 字符串 | 点击发生所在的工具：`sqa`或`pf` | ### report_section_titles_anonymized.parquet 记录ScholarQA生成的报告中的章节标题。每份报告包含多个章节，本表格将章节索引与其标题进行关联。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `thread_id` | 字符串 | 对话线程的哈希标识符 | | `section_idx` | 整数 | 报告内章节的索引（从0开始） | | `section_title` | 字符串 | 章节的标题 | ### report_corpus_ids_anonymized.parquet 记录ScholarQA报告章节中引用的论文，代表ScholarQA在生成报告时检索并引用的论文集合。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `thread_id` | 字符串 | 对话线程的哈希标识符 | | `corpus_id` | 整数 | 被引用论文的语义学者语料库ID | ### pf_shown_results_anonymized.parquet 记录Paper Finder搜索结果中展示的论文，每一行代表在搜索结果列表的特定位置向用户展示的一篇论文。 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `thread_id` | 字符串 | 对话线程的哈希标识符 | | `query_ts` | 时间戳 | 查询提交的时间 | | `thread_pf_query_num` | 整数 | 对话内的查询序号（1代表首次查询） | | `result_position` | 整数 | 搜索结果列表中的位置（从0开始） | | `corpus_id` | 整数 | 展示论文的语义学者语料库ID | ## 数据集关联方法所有数据集均可通过`thread_id`作为主键进行关联，所有表格中的哈希线程ID保持一致，可采用与原始数据相同的关联方式。常见分析场景包括： - **展开章节分析**：通过`thread_id`以及`section_id`/`section_idx`关联`section_expansions_anonymized`与`report_section_titles_anonymized`表格，可分析用户点击了哪些章节标题 - **引用论文点击分析**：通过`thread_id`与`corpus_id`关联（筛选`tool='sqa'`的）`s2_link_clicks_anonymized`与`report_corpus_ids_anonymized`表格，可识别用户点击了哪些引用论文 - **Paper Finder点击率分析**：通过`thread_id`与`corpus_id`关联（筛选`tool='pf'`的）`pf_shown_results_anonymized`与`s2_link_clicks_anonymized`表格，可按结果位置计算点击率 ## 引用说明 [待办：添加引用信息] ## 许可证本数据集采用ODC-BY许可证发布，仅可用于科研与教育用途，请遵循艾伦人工智能研究所（Allen AI）的[负责任使用指南](https://allenai.org/responsible-use)。

提供机构：

allenai

5,000+

优质数据集

54 个

任务类型

进入经典数据集