searchsim/cognitive-traces-stackoverflow
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/searchsim/cognitive-traces-stackoverflow
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-classification
- token-classification
language:
- en
tags:
- information-retrieval
- user-simulation
- cognitive-modeling
- information-foraging-theory
- search-logs
pretty_name: "Cognitive Traces — Stack Overflow"
size_categories:
- 100K<n<1M
---
# Cognitive Traces — Stack Overflow
## Dataset Description
This dataset contains **cognitive trace annotations** for the Stack Overflow dataset, produced by the multi-agent annotation framework described in:
> **Beyond the Click: A Framework for Inferring Cognitive Traces in Search**
> Saber Zerhoudi, Michael Granitzer. ECIR 2026.
Each user event (question, answer, comment, edit, vote) is annotated with a cognitive label from **Information Foraging Theory (IFT)**, along with the full annotation chain (analyst, critic, judge) and confidence scores.
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Sessions | 18,629 |
| Events | 175,326 |
| Action Types | COMMENT, POST_ANSWER, POST_QUESTION, EDIT_INITIAL_BODY, EDIT_INITIAL_TAGS, EDIT_INITIAL_TITLE, EDIT_BODY, EDIT_TAGS, EDIT_TITLE, VOTE_BOUNTY_START, VOTE_UP (11) |
| Cognitive Labels | 6 (FollowingScent, ApproachingSource, ForagingSuccess, DietEnrichment, PoorScent, LeavingPatch) |
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("searchsim/cognitive-traces-stackoverflow")
# Access the data
print(ds["train"][0])
# Filter by cognitive label
struggling = ds["train"].filter(lambda x: x["cognitive_label"] == "PoorScent")
print(f"Events with PoorScent: {len(struggling)}")
# Get all events for a session
session = ds["train"].filter(lambda x: x["session_id"] == "so_session_1543")
for event in session:
print(f" {event["action_type"]}: {event["cognitive_label"]}")
```
## Column Schema
| Column | Type | Description |
|--------|------|-------------|
| `session_id` | string | Unique session identifier |
| `event_id` | string | Unique event identifier |
| `event_timestamp` | string | ISO timestamp |
| `action_type` | string | User action type (11 types, see above) |
| `content` | string | Event content (question body, answer text, comment, etc.) |
| `cognitive_label` | string | Final IFT cognitive label |
| `analyst_label` | string | Analyst agent's proposed label |
| `analyst_justification` | string | Analyst's reasoning |
| `critic_label` | string | Critic agent's proposed label |
| `critic_agreement` | string | Whether Critic agreed with Analyst |
| `critic_justification` | string | Critic's reasoning |
| `judge_justification` | string | Judge's final decision reasoning |
| `confidence_score` | float | Framework confidence (0–1) |
| `disagreement_score` | float | Analyst–Critic disagreement (0–1) |
| `flagged_for_review` | bool | Whether flagged for human review |
| `pipeline_mode` | string | Annotation pipeline mode |
## IFT Cognitive Labels
| Label | IFT Concept | Interpretation |
|-------|-------------|----------------|
| FollowingScent | Information scent following | User pursuing a promising trail |
| ApproachingSource | Source approaching | User converging on target information |
| ForagingSuccess | Successful foraging | User found desired information |
| DietEnrichment | Diet enrichment | User broadening information intake |
| PoorScent | Poor information scent | Trail quality deteriorating |
| LeavingPatch | Patch leaving | User abandoning current direction |
## Source Dataset
Based on the Stack Overflow Data Dump (Stack Exchange). Contains questions, answers, comments, edits, and votes from the Stack Overflow Q&A platform.
## Citation
```bibtex
@inproceedings{zerhoudi2026beyond,
title={Beyond the Click: A Framework for Inferring Cognitive Traces in Search},
author={Zerhoudi, Saber and Granitzer, Michael},
booktitle={Proceedings of the 48th European Conference on Information Retrieval (ECIR)},
year={2026}
}
```
## License
CC-BY-4.0. The cognitive annotations are released under Creative Commons Attribution 4.0. The underlying source datasets have their own licenses — please refer to the original dataset providers.
## Links
- [Paper](https://traces.searchsim.org/)
- [GitHub Repository](https://github.com/searchsim-org/cognitive-traces)
- [Annotation Tool](https://github.com/searchsim-org/cognitive-traces)
license: CC BY 4.0
task_categories:
- 文本分类(text-classification)
- 令牌分类(token-classification)
language:
- en
tags:
- 信息检索(information-retrieval)
- 用户模拟(user-simulation)
- 认知建模(cognitive-modeling)
- 信息觅食理论(information-foraging-theory)
- 搜索日志(search-logs)
pretty_name: "认知痕迹 — Stack Overflow"
size_categories:
- 100K<n<1M
# 认知痕迹 — Stack Overflow
## 数据集说明
本数据集为Stack Overflow数据集附带**认知痕迹标注(cognitive trace annotations)**,由下述文献中提及的多智能体标注框架生成:
> **《超越点击:搜索中认知痕迹的推断框架》**
> Saber Zerhoudi、Michael Granitzer,发表于ECIR 2026。
每个用户事件(包括提问、回答、评论、编辑、投票)均被标注了来自**信息觅食理论(Information Foraging Theory,IFT)**的认知标签,并附带完整的标注链(分析师、评审员、评判员)与置信度分数。
## 数据集统计
| 指标 | 数值 |
|--------|-------|
| 会话数 | 18,629 |
| 事件数 | 175,326 |
| 动作类型 | COMMENT(评论)、POST_ANSWER(提交回答)、POST_QUESTION(提交提问)、EDIT_INITIAL_BODY(初始正文编辑)、EDIT_INITIAL_TAGS(初始标签编辑)、EDIT_INITIAL_TITLE(初始标题编辑)、EDIT_BODY(正文编辑)、EDIT_TAGS(标签编辑)、EDIT_TITLE(标题编辑)、VOTE_BOUNTY_START(悬赏发起投票)、VOTE_UP(点赞),共11种 |
| 认知标签 | 6种(FollowingScent、ApproachingSource、ForagingSuccess、DietEnrichment、PoorScent、LeavingPatch) |
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("searchsim/cognitive-traces-stackoverflow")
# 访问数据
print(ds["train"][0])
# 按认知标签筛选
struggling = ds["train"].filter(lambda x: x["cognitive_label"] == "PoorScent")
print(f"Events with PoorScent: {len(struggling)}")
# 获取单个会话的所有事件
session = ds["train"].filter(lambda x: x["session_id"] == "so_session_1543")
for event in session:
print(f" {event["action_type"]}: {event["cognitive_label"]}")
## 列结构
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `session_id` | 字符串 | 唯一会话标识符 |
| `event_id` | 字符串 | 唯一事件标识符 |
| `event_timestamp` | 字符串 | ISO格式时间戳 |
| `action_type` | 字符串 | 用户动作类型(共11种,详见上文) |
| `content` | 字符串 | 事件内容(提问正文、回答文本、评论等) |
| `cognitive_label` | 字符串 | 最终IFT认知标签 |
| `analyst_label` | 字符串 | 分析师智能体提出的标签 |
| `analyst_justification` | 字符串 | 分析师的标注理由 |
| `critic_label` | 字符串 | 评审员智能体提出的标签 |
| `critic_agreement` | 字符串 | 评审员是否同意分析师的标注 |
| `critic_justification` | 字符串 | 评审员的标注理由 |
| `judge_justification` | 字符串 | 评判员的最终决策理由 |
| `confidence_score` | 浮点型 | 框架置信度(范围0–1) |
| `disagreement_score` | 浮点型 | 分析师与评审员的分歧度(范围0–1) |
| `flagged_for_review` | 布尔型 | 是否标记为需人工审核 |
| `pipeline_mode` | 字符串 | 标注流水线模式 |
## IFT认知标签
| 标签 | IFT概念 | 含义阐释 |
|-------|-------------|----------------|
| FollowingScent | 信息追踪(Information scent following) | 用户沿优质线索展开探索 |
| ApproachingSource | 逼近信息源(Source approaching) | 用户逐步定位目标信息 |
| ForagingSuccess | 觅食成功(Successful foraging) | 用户成功获取所需信息 |
| DietEnrichment | 拓展信息摄入(Diet enrichment) | 用户拓宽信息获取范围 |
| PoorScent | 线索质量低下(Poor information scent) | 探索线索的质量下降 |
| LeavingPatch | 放弃当前路径(Patch leaving) | 用户终止当前探索方向 |
## 源数据集
本数据集基于Stack Overflow数据转储(Stack Exchange),包含Stack Overflow问答平台的提问、回答、评论、编辑记录与投票数据。
## 引用
bibtex
@inproceedings{zerhoudi2026beyond,
title={Beyond the Click: A Framework for Inferring Cognitive Traces in Search},
author={Zerhoudi, Saber and Granitzer, Michael},
booktitle={Proceedings of the 48th European Conference on Information Retrieval (ECIR)},
year={2026}
}
## 许可证
CC BY 4.0。本数据集的认知标注采用知识共享署名4.0协议发布。底层源数据集拥有各自的许可证,请参考原始数据集提供者的相关说明。
## 相关链接
- [论文](https://traces.searchsim.org/)
- [GitHub仓库](https://github.com/searchsim-org/cognitive-traces)
- [标注工具](https://github.com/searchsim-org/cognitive-traces)
提供机构:
searchsim



