LLMs in Genetics MCQs
收藏DataCite Commons2026-04-16 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=ed25b7b073854280b644bf87ff3049c4
下载链接
链接失效反馈官方服务:
资源简介:
Project OverviewThis dataset supports a study evaluating the performance and reliability of five large language models (LLMs) — Gemini, Claude, ChatGPT, Copilot, and DeepSeek — in answering genetics multiple-choice questions (MCQs) in a medical education context. The study used 200 USMLE-style MCQs distributed across 20 genetics topics, with each LLM completing three independent testing sessions. Questions were additionally classified by Bloom's Taxonomy levels (1–4) to assess performance across cognitive complexity levels.Questions.csv contains 200 rows — one per MCQ. Each question is tagged with a Bloom's Taxonomy level (1–4), and the five LLM columns (Gemini, Claude, ChatGPT, Copilot, DeepSeek) show the average score across 3 independent attempts, where 1.0 = always correct, 0.0 = always incorrect, and values in between (e.g., 0.67) indicate the model got it right in 2 out of 3 attempts.Topics.csv contains 20 rows — one per genetics topic. For each of the five LLMs, there are three separate columns (A1, A2, A3) representing the percentage of correct answers in each individual testing session. This allows both within-model consistency (reliability across attempts) and between-model comparisons to be assessed at the topic level.Together, the two files capture performance at two levels of granularity: question-level (with Bloom's classification) and topic-level (with attempt-by-attempt breakdown), enabling analysis of both accuracy and test-retest reliability across cognitive complexity levels and subject areas.
提供机构:
Science Data Bank
创建时间:
2026-04-16



