searchsim/cognitive-traces-stackoverflow

Name: searchsim/cognitive-traces-stackoverflow
Creator: searchsim
Published: 2026-03-19 23:05:09
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/searchsim/cognitive-traces-stackoverflow

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification - token-classification language: - en tags: - information-retrieval - user-simulation - cognitive-modeling - information-foraging-theory - search-logs pretty_name: "Cognitive Traces — Stack Overflow" size_categories: - 100K<n<1M --- # Cognitive Traces — Stack Overflow ## Dataset Description This dataset contains **cognitive trace annotations** for the Stack Overflow dataset, produced by the multi-agent annotation framework described in: > **Beyond the Click: A Framework for Inferring Cognitive Traces in Search** > Saber Zerhoudi, Michael Granitzer. ECIR 2026. Each user event (question, answer, comment, edit, vote) is annotated with a cognitive label from **Information Foraging Theory (IFT)**, along with the full annotation chain (analyst, critic, judge) and confidence scores. ## Dataset Statistics | Metric | Value | |--------|-------| | Sessions | 18,629 | | Events | 175,326 | | Action Types | COMMENT, POST_ANSWER, POST_QUESTION, EDIT_INITIAL_BODY, EDIT_INITIAL_TAGS, EDIT_INITIAL_TITLE, EDIT_BODY, EDIT_TAGS, EDIT_TITLE, VOTE_BOUNTY_START, VOTE_UP (11) | | Cognitive Labels | 6 (FollowingScent, ApproachingSource, ForagingSuccess, DietEnrichment, PoorScent, LeavingPatch) | ## Quick Start ```python from datasets import load_dataset ds = load_dataset("searchsim/cognitive-traces-stackoverflow") # Access the data print(ds["train"][0]) # Filter by cognitive label struggling = ds["train"].filter(lambda x: x["cognitive_label"] == "PoorScent") print(f"Events with PoorScent: {len(struggling)}") # Get all events for a session session = ds["train"].filter(lambda x: x["session_id"] == "so_session_1543") for event in session: print(f" {event["action_type"]}: {event["cognitive_label"]}") ``` ## Column Schema | Column | Type | Description | |--------|------|-------------| | `session_id` | string | Unique session identifier | | `event_id` | string | Unique event identifier | | `event_timestamp` | string | ISO timestamp | | `action_type` | string | User action type (11 types, see above) | | `content` | string | Event content (question body, answer text, comment, etc.) | | `cognitive_label` | string | Final IFT cognitive label | | `analyst_label` | string | Analyst agent's proposed label | | `analyst_justification` | string | Analyst's reasoning | | `critic_label` | string | Critic agent's proposed label | | `critic_agreement` | string | Whether Critic agreed with Analyst | | `critic_justification` | string | Critic's reasoning | | `judge_justification` | string | Judge's final decision reasoning | | `confidence_score` | float | Framework confidence (0–1) | | `disagreement_score` | float | Analyst–Critic disagreement (0–1) | | `flagged_for_review` | bool | Whether flagged for human review | | `pipeline_mode` | string | Annotation pipeline mode | ## IFT Cognitive Labels | Label | IFT Concept | Interpretation | |-------|-------------|----------------| | FollowingScent | Information scent following | User pursuing a promising trail | | ApproachingSource | Source approaching | User converging on target information | | ForagingSuccess | Successful foraging | User found desired information | | DietEnrichment | Diet enrichment | User broadening information intake | | PoorScent | Poor information scent | Trail quality deteriorating | | LeavingPatch | Patch leaving | User abandoning current direction | ## Source Dataset Based on the Stack Overflow Data Dump (Stack Exchange). Contains questions, answers, comments, edits, and votes from the Stack Overflow Q&A platform. ## Citation ```bibtex @inproceedings{zerhoudi2026beyond, title={Beyond the Click: A Framework for Inferring Cognitive Traces in Search}, author={Zerhoudi, Saber and Granitzer, Michael}, booktitle={Proceedings of the 48th European Conference on Information Retrieval (ECIR)}, year={2026} } ``` ## License CC-BY-4.0. The cognitive annotations are released under Creative Commons Attribution 4.0. The underlying source datasets have their own licenses — please refer to the original dataset providers. ## Links - [Paper](https://traces.searchsim.org/) - [GitHub Repository](https://github.com/searchsim-org/cognitive-traces) - [Annotation Tool](https://github.com/searchsim-org/cognitive-traces)

license: CC BY 4.0 task_categories: - 文本分类（text-classification） - 令牌分类（token-classification） language: - en tags: - 信息检索（information-retrieval） - 用户模拟（user-simulation） - 认知建模（cognitive-modeling） - 信息觅食理论（information-foraging-theory） - 搜索日志（search-logs） pretty_name: "认知痕迹 — Stack Overflow" size_categories: - 100K<n<1M # 认知痕迹 — Stack Overflow ## 数据集说明本数据集为Stack Overflow数据集附带**认知痕迹标注（cognitive trace annotations）**，由下述文献中提及的多智能体标注框架生成： > **《超越点击：搜索中认知痕迹的推断框架》** > Saber Zerhoudi、Michael Granitzer，发表于ECIR 2026。每个用户事件（包括提问、回答、评论、编辑、投票）均被标注了来自**信息觅食理论（Information Foraging Theory，IFT）**的认知标签，并附带完整的标注链（分析师、评审员、评判员）与置信度分数。 ## 数据集统计 | 指标 | 数值 | |--------|-------| | 会话数 | 18,629 | | 事件数 | 175,326 | | 动作类型 | COMMENT（评论）、POST_ANSWER（提交回答）、POST_QUESTION（提交提问）、EDIT_INITIAL_BODY（初始正文编辑）、EDIT_INITIAL_TAGS（初始标签编辑）、EDIT_INITIAL_TITLE（初始标题编辑）、EDIT_BODY（正文编辑）、EDIT_TAGS（标签编辑）、EDIT_TITLE（标题编辑）、VOTE_BOUNTY_START（悬赏发起投票）、VOTE_UP（点赞），共11种 | | 认知标签 | 6种（FollowingScent、ApproachingSource、ForagingSuccess、DietEnrichment、PoorScent、LeavingPatch） | ## 快速上手 python from datasets import load_dataset ds = load_dataset("searchsim/cognitive-traces-stackoverflow") # 访问数据 print(ds["train"][0]) # 按认知标签筛选 struggling = ds["train"].filter(lambda x: x["cognitive_label"] == "PoorScent") print(f"Events with PoorScent: {len(struggling)}") # 获取单个会话的所有事件 session = ds["train"].filter(lambda x: x["session_id"] == "so_session_1543") for event in session: print(f" {event["action_type"]}: {event["cognitive_label"]}") ## 列结构 | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `session_id` | 字符串 | 唯一会话标识符 | | `event_id` | 字符串 | 唯一事件标识符 | | `event_timestamp` | 字符串 | ISO格式时间戳 | | `action_type` | 字符串 | 用户动作类型（共11种，详见上文） | | `content` | 字符串 | 事件内容（提问正文、回答文本、评论等） | | `cognitive_label` | 字符串 | 最终IFT认知标签 | | `analyst_label` | 字符串 | 分析师智能体提出的标签 | | `analyst_justification` | 字符串 | 分析师的标注理由 | | `critic_label` | 字符串 | 评审员智能体提出的标签 | | `critic_agreement` | 字符串 | 评审员是否同意分析师的标注 | | `critic_justification` | 字符串 | 评审员的标注理由 | | `judge_justification` | 字符串 | 评判员的最终决策理由 | | `confidence_score` | 浮点型 | 框架置信度（范围0–1） | | `disagreement_score` | 浮点型 | 分析师与评审员的分歧度（范围0–1） | | `flagged_for_review` | 布尔型 | 是否标记为需人工审核 | | `pipeline_mode` | 字符串 | 标注流水线模式 | ## IFT认知标签 | 标签 | IFT概念 | 含义阐释 | |-------|-------------|----------------| | FollowingScent | 信息追踪（Information scent following） | 用户沿优质线索展开探索 | | ApproachingSource | 逼近信息源（Source approaching） | 用户逐步定位目标信息 | | ForagingSuccess | 觅食成功（Successful foraging） | 用户成功获取所需信息 | | DietEnrichment | 拓展信息摄入（Diet enrichment） | 用户拓宽信息获取范围 | | PoorScent | 线索质量低下（Poor information scent） | 探索线索的质量下降 | | LeavingPatch | 放弃当前路径（Patch leaving） | 用户终止当前探索方向 | ## 源数据集本数据集基于Stack Overflow数据转储（Stack Exchange），包含Stack Overflow问答平台的提问、回答、评论、编辑记录与投票数据。 ## 引用 bibtex @inproceedings{zerhoudi2026beyond, title={Beyond the Click: A Framework for Inferring Cognitive Traces in Search}, author={Zerhoudi, Saber and Granitzer, Michael}, booktitle={Proceedings of the 48th European Conference on Information Retrieval (ECIR)}, year={2026} } ## 许可证 CC BY 4.0。本数据集的认知标注采用知识共享署名4.0协议发布。底层源数据集拥有各自的许可证，请参考原始数据集提供者的相关说明。 ## 相关链接 - [论文](https://traces.searchsim.org/) - [GitHub仓库](https://github.com/searchsim-org/cognitive-traces) - [标注工具](https://github.com/searchsim-org/cognitive-traces)

提供机构：

searchsim

5,000+

优质数据集

54 个

任务类型

进入经典数据集