MedSeek_userBehavior
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/2hnjmzpxyd
下载链接
链接失效反馈官方服务:
资源简介:
The target dataset contains de-identified, high-resolution interaction information from MedSeek, a large-language-model (LLM) platform optimised for medical education. Built on the DeepSeek architecture and fine-tuned with >200 M clinically curated instruction–response pairs, MedSeek achieves state-of-the-art accuracy on multiple medical NLP benchmarks, including MedQA (78.6 %), PubMedQA (83.9 %), MedMCQA (67.4 %), MedBullets (72.1 %), MMLU (81.2 %), MMLU-Pro (79.5 %) and CARE-QA (74.8 %). This dataset captures usage patterns from a medical education large language model (LLM) platform, representing interaction behaviors during Q2 2025. It contains anonymized observational records of platform engagement across diverse medical education contexts.
#### Dataset Components:
1. **User Profiles** (`medical_llm_users.csv`)
* 1,454 anonymized participant records
* Role distribution: Educators (2.2%), Medical students (97.8%)
* Discipline representation: Clinical Medicine (39%), Pharmacy (18%), Public Health (11%), Basic Medicine (26%), Nursing (4%), Medical Humanities (2%)
* Engagement tiers: High-engagement (15%), Regular (25%), Low-frequency (40%), Dormant (20%)
2. **Session Records** (`medical_llm_sessions.csv`)
* Platform access sessions with temporal metadata
* Device access patterns (mobile/desktop/tablet)
* Duration metrics and temporal distribution
* Special annotation for examination period (May 10-24, 2025)
3. **Interaction Logs** (`medical_llm_interactions.csv`)
* Question-Answer exchanges across medical domains
* Six knowledge domains with topic classifications
* Interaction types: Initial queries (35%), Follow-ups (25%), Answer review (20%), Clarifications (10%), Content saving (5%), Feedback (5%)
* Complexity engagement metrics
#### Data Harness:
Data was harnessed through parameterized behavioral modeling based on established medical education frameworks. The process incorporates:
* Professionally validated medical education taxonomies
* Temporal usage distributions reflecting academic calendars
* Device access patterns aligned with mobility studies
* Knowledge domain representations mirroring standard curricula
#### Potential Research Applications:
* Medical education technology adoption studies
* Temporal analysis of learning behaviors
* Domain-specific knowledge retrieval patterns
* Adaptive learning system development
* Educational data mining methodology validation
#### Ethical Compliance:
All identifiers represent anonymized entities. Content follows medical education standards without including real patient information or personally identifiable data. Generated text reflects generalized medical education scenarios without specific case references.
本目标数据集包含来自MedSeek的去标识化高分辨率交互信息。MedSeek是一款专为医学教育优化的大语言模型(Large Language Model,LLM)平台,其基于DeepSeek架构构建,并通过2亿余条经临床筛选标注的指令-回复对进行微调。该模型在多项医学自然语言处理(Natural Language Processing,NLP)基准测试中达到当前最优精度,包括MedQA(78.6%)、PubMedQA(83.9%)、MedMCQA(67.4%)、MedBullets(72.1%)、MMLU(81.2%)、MMLU-Pro(79.5%)以及CARE-QA(74.8%)。本数据集采集了该医学教育大语言模型平台的使用模式,涵盖2025年第二季度的交互行为,并包含跨多样医学教育场景下的平台使用匿名观测记录。
#### 数据集组成
1. **用户画像**(`medical_llm_users.csv`)
* 1,454条匿名参与者记录
* 角色分布:教育工作者(2.2%)、医学生(97.8%)
* 学科覆盖:临床医学(39%)、药学(18%)、公共卫生(11%)、基础医学(26%)、护理学(4%)、医学人文(2%)
* 使用层级:高活跃用户(15%)、常规活跃用户(25%)、低频使用者(40%)、休眠用户(20%)
2. **会话记录**(`medical_llm_sessions.csv`)
* 带有时间元数据的平台访问会话
* 设备访问模式(移动/桌面/平板)
* 时长指标与时间分布
* 针对2025年5月10日至24日考试周期的特殊标注
3. **交互日志**(`medical_llm_interactions.csv`)
* 跨医学领域的问答交互数据
* 6个带有主题分类的知识领域
* 交互类型:初始查询(35%)、跟进提问(25%)、答案审阅(20%)、请求澄清(10%)、内容保存(5%)、反馈提交(5%)
* 复杂度参与度指标
#### 数据集应用框架
本数据集依托成熟的医学教育框架,采用参数化行为建模方法进行开发与应用,具体包含以下内容:
* 经专业验证的医学教育分类体系
* 反映学术校历的时间使用分布特征
* 与移动行为研究相符的设备访问模式
* 匹配标准课程体系的知识领域覆盖
#### 潜在研究应用场景
* 医学教育技术采用度研究
* 学习行为的时间维度分析
* 特定领域知识检索模式研究
* 自适应学习系统开发
* 教育数据挖掘方法学验证
#### 伦理合规性说明
所有标识均为匿名实体,内容严格遵循医学教育标准,未包含真实患者信息或个人可识别数据。生成文本均为通用医学教育场景,未涉及具体病例参考。
创建时间:
2025-08-05



