allenai/asta-user-interactions
收藏Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/allenai/asta-user-interactions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
configs:
- config_name: optin_queries
data_files:
- split: train
path: data/optin_queries_anonymized.parquet
- config_name: section_expansions
data_files:
- split: train
path: data/section_expansions_anonymized.parquet
- config_name: s2_link_clicks
data_files:
- split: train
path: data/s2_link_clicks_anonymized.parquet
- config_name: report_section_titles
data_files:
- split: train
path: data/report_section_titles_anonymized.parquet
- config_name: report_corpus_ids
data_files:
- split: train
path: data/report_corpus_ids_anonymized.parquet
- config_name: pf_shown_results
data_files:
- split: train
path: data/pf_shown_results_anonymized.parquet
---
# ScholarQA and Paper Finder User Interaction Dataset
This dataset contains anonymized user interaction data from two AI-powered research tools: **ScholarQA (SQA)** and **Paper Finder (PF)**. The data comes from users who opted in to having their queries and interaction data released.
## Usage
```python
from datasets import load_dataset
queries = load_dataset("allenai/asta-user-interactions", "optin_queries")["train"]
clicks = load_dataset("allenai/asta-user-interactions", "s2_link_clicks")["train"]
shown_results = load_dataset("allenai/asta-user-interactions", "pf_shown_results")["train"]
```
Available configs: `optin_queries`, `section_expansions`, `s2_link_clicks`, `report_section_titles`, `report_corpus_ids`, `pf_shown_results`.
## Tools
**ScholarQA** is a conversational research assistant that generates comprehensive literature review reports in response to user queries. Reports are organized into sections, with each section containing synthesized information and citations to relevant papers.
**Paper Finder** is a semantic paper search tool that returns ranked lists of relevant papers based on user queries.
## Dataset Overview
| File | Description |
|------|-------------|
| `optin_queries_anonymized.parquet` | User queries submitted to both tools |
| `section_expansions_anonymized.parquet` | Section expand clicks in SQA reports |
| `s2_link_clicks_anonymized.parquet` | Clicks on Semantic Scholar paper links |
| `report_section_titles_anonymized.parquet` | Section titles generated in SQA reports |
| `report_corpus_ids_anonymized.parquet` | Papers cited in SQA report sections |
| `pf_shown_results_anonymized.parquet` | Papers shown in Paper Finder search results |
## Summary Statistics
| Metric | SQA | PF |
|--------|-----|-----|
| Threads (queries) | 127,465 | 131,534 |
| Paper link clicks | 24,304 | 181,854 |
| Threads with clicks | 5,860 | 47,894 |
| Papers referenced | 2,066,393 | 3,858,774 |
Additional statistics:
- Section expansions: 226,004 events across 33,408 threads
- Unique section titles: 500,255
## Data Collection Period
Original collection window: January 1, 2025 to August 26, 2025
Date ranges by dataset:
| Dataset | Earliest | Latest |
|---------|----------|--------|
| optin_queries | 2025-02-25 | 2025-08-27 |
| section_expansions | 2025-06-03 | 2025-08-27 |
| s2_link_clicks | 2025-03-26 | 2025-08-27 |
| pf_shown_results | 2025-03-19 | 2025-08-27 |
Note: Actual date ranges are output to `anonymization_stats.json` when the pipeline runs.
## File Descriptions
### optin_queries_anonymized.parquet
User queries submitted to ScholarQA and Paper Finder.
| Column | Type | Description |
|--------|------|-------------|
| `query` | string | The user's query text |
| `thread_id` | string | Hashed identifier for the conversation thread |
| `query_ts` | timestamp | When the query was submitted |
| `tool` | string | Which tool was used: `sqa` or `pf` |
### section_expansions_anonymized.parquet
Records of users expanding (clicking to view) sections in ScholarQA reports. ScholarQA reports are displayed with sections collapsed by default; this table captures when users click to expand and read a section.
| Column | Type | Description |
|--------|------|-------------|
| `thread_id` | string | Hashed identifier for the conversation thread |
| `section_expand_ts` | timestamp | When the section was expanded |
| `section_id` | int | Index of the expanded section (0-indexed) |
### s2_link_clicks_anonymized.parquet
Clicks on Semantic Scholar paper links within either tool. These represent explicit user interest in viewing a cited or retrieved paper.
| Column | Type | Description |
|--------|------|-------------|
| `thread_id` | string | Hashed identifier for the conversation thread |
| `s2_link_click_ts` | timestamp | When the link was clicked |
| `corpus_id` | int | Semantic Scholar corpus ID of the clicked paper |
| `tool` | string | Which tool the click occurred in: `sqa` or `pf` |
### report_section_titles_anonymized.parquet
Section titles from ScholarQA generated reports. Each report contains multiple sections, and this table maps section indices to their titles.
| Column | Type | Description |
|--------|------|-------------|
| `thread_id` | string | Hashed identifier for the conversation thread |
| `section_idx` | int | Index of the section within the report (0-indexed) |
| `section_title` | string | Title of the section |
### report_corpus_ids_anonymized.parquet
Papers cited in ScholarQA report sections. This represents the set of papers that ScholarQA retrieved and cited when generating a report.
| Column | Type | Description |
|--------|------|-------------|
| `thread_id` | string | Hashed identifier for the conversation thread |
| `corpus_id` | int | Semantic Scholar corpus ID of the cited paper |
### pf_shown_results_anonymized.parquet
Papers shown in Paper Finder search results. Each row represents a paper that was displayed to the user at a specific position in the results list.
| Column | Type | Description |
|--------|------|-------------|
| `thread_id` | string | Hashed identifier for the conversation thread |
| `query_ts` | timestamp | When the query was submitted |
| `thread_pf_query_num` | int | Query number within the thread (1 = first query) |
| `result_position` | int | Position in the results list (0-indexed) |
| `corpus_id` | int | Semantic Scholar corpus ID of the shown paper |
## Joining Datasets
All datasets can be joined using `thread_id` as the primary key. The hashed thread IDs are consistent across all tables, enabling the same joins as the original data. Common analyses include:
- **Clicked sections**: Join `section_expansions_anonymized` with `report_section_titles_anonymized` on `thread_id` and `section_id`/`section_idx` to see which section titles users clicked
- **Clicked citations**: Join `s2_link_clicks_anonymized` (filtered to `tool='sqa'`) with `report_corpus_ids_anonymized` on `thread_id` and `corpus_id` to identify which cited papers users clicked
- **Paper Finder click-through**: Join `pf_shown_results_anonymized` with `s2_link_clicks_anonymized` (filtered to `tool='pf'`) on `thread_id` and `corpus_id` to compute click-through rates by position
## Citation
[TODO: Add citation information]
## License
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
---
license: ODC-BY
配置项:
- 配置名称: optin_queries
数据文件:
- 拆分方式: 训练集
- 路径: data/optin_queries_anonymized.parquet
- 配置名称: section_expansions
数据文件:
- 拆分方式: 训练集
- 路径: data/section_expansions_anonymized.parquet
- 配置名称: s2_link_clicks
数据文件:
- 拆分方式: 训练集
- 路径: data/s2_link_clicks_anonymized.parquet
- 配置名称: report_section_titles
数据文件:
- 拆分方式: 训练集
- 路径: data/report_section_titles_anonymized.parquet
- 配置名称: report_corpus_ids
数据文件:
- 拆分方式: 训练集
- 路径: data/report_corpus_ids_anonymized.parquet
- 配置名称: pf_shown_results
数据文件:
- 拆分方式: 训练集
- 路径: data/pf_shown_results_anonymized.parquet
---
# ScholarQA与Paper Finder用户交互数据集
本数据集包含两款人工智能驱动的科研工具**ScholarQA(SQA)**与**Paper Finder(PF)**的匿名化用户交互数据,数据源自自愿同意公开其查询与交互数据的用户。
## 使用方法
python
from datasets import load_dataset
queries = load_dataset("allenai/asta-user-interactions", "optin_queries")["train"]
clicks = load_dataset("allenai/asta-user-interactions", "s2_link_clicks")["train"]
shown_results = load_dataset("allenai/asta-user-interactions", "pf_shown_results")["train"]
可用配置项包括:`optin_queries`、`section_expansions`、`s2_link_clicks`、`report_section_titles`、`report_corpus_ids`、`pf_shown_results`。
## 工具介绍
**ScholarQA(SQA)**是一款对话式科研助手,可根据用户查询生成全面的文献综述报告。报告按章节组织,每个章节包含整合后的信息与相关论文的引用。
**Paper Finder(PF)**是一款语义化论文搜索工具,可根据用户查询返回按相关性排序的相关论文列表。
## 数据集概览
| 数据文件 | 描述 |
|------|-------------|
| `optin_queries_anonymized.parquet` | 提交至两款工具的用户查询 |
| `section_expansions_anonymized.parquet` | ScholarQA报告中的章节展开点击行为 |
| `s2_link_clicks_anonymized.parquet` | 语义学者(Semantic Scholar)论文链接点击行为 |
| `report_section_titles_anonymized.parquet` | ScholarQA报告中生成的章节标题 |
| `report_corpus_ids_anonymized.parquet` | ScholarQA报告章节中引用的论文 |
| `pf_shown_results_anonymized.parquet` | Paper Finder搜索结果中展示的论文 |
## 统计摘要
| 指标 | SQA | PF |
|--------|-----|-----|
| 对话线程(查询数) | 127,465 | 131,534 |
| 论文链接点击量 | 24,304 | 181,854 |
| 产生点击的对话线程数 | 5,860 | 47,894 |
| 引用/涉及的论文数 | 2,066,393 | 3,858,774 |
补充统计信息:
- 章节展开事件:共33,408个对话线程中产生226,004次章节展开事件
- 唯一章节标题数:500,255
## 数据收集周期
原始收集窗口:2025年1月1日至2025年8月26日
各数据集的时间范围:
| 数据集 | 最早时间 | 最晚时间 |
|---------|----------|--------|
| optin_queries | 2025-02-25 | 2025-08-27 |
| section_expansions | 2025-06-03 | 2025-08-27 |
| s2_link_clicks | 2025-03-26 | 2025-08-27 |
| pf_shown_results | 2025-03-19 | 2025-08-27 |
注:实际的时间范围将在数据处理流水线运行时输出至`anonymization_stats.json`文件。
## 文件详细说明
### optin_queries_anonymized.parquet
提交至ScholarQA与Paper Finder的用户查询数据。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `query` | 字符串 | 用户的查询文本 |
| `thread_id` | 字符串 | 对话线程的哈希标识符 |
| `query_ts` | 时间戳 | 查询提交的时间 |
| `tool` | 字符串 | 使用的工具:`sqa`或`pf` |
### section_expansions_anonymized.parquet
记录用户在ScholarQA报告中展开(点击查看)章节的行为。ScholarQA报告默认折叠所有章节,本表格记录用户点击展开并阅读章节的时间与信息。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `thread_id` | 字符串 | 对话线程的哈希标识符 |
| `section_expand_ts` | 时间戳 | 章节展开的时间 |
| `section_id` | 整数 | 展开章节的索引(从0开始) |
### s2_link_clicks_anonymized.parquet
记录用户在任意一款工具中点击语义学者(Semantic Scholar)论文链接的行为,代表用户对已引用或检索到的论文的明确访问意愿。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `thread_id` | 字符串 | 对话线程的哈希标识符 |
| `s2_link_click_ts` | 时间戳 | 链接点击的时间 |
| `corpus_id` | 整数 | 被点击论文的语义学者语料库ID |
| `tool` | 字符串 | 点击发生所在的工具:`sqa`或`pf` |
### report_section_titles_anonymized.parquet
记录ScholarQA生成的报告中的章节标题。每份报告包含多个章节,本表格将章节索引与其标题进行关联。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `thread_id` | 字符串 | 对话线程的哈希标识符 |
| `section_idx` | 整数 | 报告内章节的索引(从0开始) |
| `section_title` | 字符串 | 章节的标题 |
### report_corpus_ids_anonymized.parquet
记录ScholarQA报告章节中引用的论文,代表ScholarQA在生成报告时检索并引用的论文集合。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `thread_id` | 字符串 | 对话线程的哈希标识符 |
| `corpus_id` | 整数 | 被引用论文的语义学者语料库ID |
### pf_shown_results_anonymized.parquet
记录Paper Finder搜索结果中展示的论文,每一行代表在搜索结果列表的特定位置向用户展示的一篇论文。
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `thread_id` | 字符串 | 对话线程的哈希标识符 |
| `query_ts` | 时间戳 | 查询提交的时间 |
| `thread_pf_query_num` | 整数 | 对话内的查询序号(1代表首次查询) |
| `result_position` | 整数 | 搜索结果列表中的位置(从0开始) |
| `corpus_id` | 整数 | 展示论文的语义学者语料库ID |
## 数据集关联方法
所有数据集均可通过`thread_id`作为主键进行关联,所有表格中的哈希线程ID保持一致,可采用与原始数据相同的关联方式。常见分析场景包括:
- **展开章节分析**:通过`thread_id`以及`section_id`/`section_idx`关联`section_expansions_anonymized`与`report_section_titles_anonymized`表格,可分析用户点击了哪些章节标题
- **引用论文点击分析**:通过`thread_id`与`corpus_id`关联(筛选`tool='sqa'`的)`s2_link_clicks_anonymized`与`report_corpus_ids_anonymized`表格,可识别用户点击了哪些引用论文
- **Paper Finder点击率分析**:通过`thread_id`与`corpus_id`关联(筛选`tool='pf'`的)`pf_shown_results_anonymized`与`s2_link_clicks_anonymized`表格,可按结果位置计算点击率
## 引用说明
[待办:添加引用信息]
## 许可证
本数据集采用ODC-BY许可证发布,仅可用于科研与教育用途,请遵循艾伦人工智能研究所(Allen AI)的[负责任使用指南](https://allenai.org/responsible-use)。
提供机构:
allenai



