five

LLMs in Genetics MCQs

收藏
DataCite Commons2026-04-16 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=ed25b7b073854280b644bf87ff3049c4
下载链接
链接失效反馈
官方服务:
资源简介:
Project OverviewThis dataset supports a study evaluating the performance and reliability of five large language models (LLMs) — Gemini, Claude, ChatGPT, Copilot, and DeepSeek — in answering genetics multiple-choice questions (MCQs) in a medical education context. The study used 200 USMLE-style MCQs distributed across 20 genetics topics, with each LLM completing three independent testing sessions. Questions were additionally classified by Bloom's Taxonomy levels (1–4) to assess performance across cognitive complexity levels.Questions.csv contains 200 rows — one per MCQ. Each question is tagged with a Bloom's Taxonomy level (1–4), and the five LLM columns (Gemini, Claude, ChatGPT, Copilot, DeepSeek) show the average score across 3 independent attempts, where 1.0 = always correct, 0.0 = always incorrect, and values in between (e.g., 0.67) indicate the model got it right in 2 out of 3 attempts.Topics.csv contains 20 rows — one per genetics topic. For each of the five LLMs, there are three separate columns (A1, A2, A3) representing the percentage of correct answers in each individual testing session. This allows both within-model consistency (reliability across attempts) and between-model comparisons to be assessed at the topic level.Together, the two files capture performance at two levels of granularity: question-level (with Bloom's classification) and topic-level (with attempt-by-attempt breakdown), enabling analysis of both accuracy and test-retest reliability across cognitive complexity levels and subject areas.
提供机构:
Science Data Bank
创建时间:
2026-04-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作