Supporting Metadata Curation from Public Life Science Databases Using Open-Weight Large Language Models

Figshare2026-02-17 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Supporting_Metadata_Curation_from_Public_Life_Science_Databases_Using_Open-Weight_Large_Language_Models/30265717

下载链接

链接失效反馈

官方服务：

资源简介：

Supplementary FiguresSupplementary Figure 1: Distributions of self-reported positive probabilities for each model (Prompt 2)Distribution of the probability p output by each model when predicting that a project is “positive.” The figure highlights the frequency of ambiguous outputs near p = 0.5 and differences in confidence behavior across models.Supplementary Figure 2: Runtime comparison across models and reasoning settingsProcessing time per project for locally executed open-weight models and closed models accessed via APIs. The figure shows how model size, mixture-of-experts (MoE) vs. dense architectures, and thinking/reasoning settings affect throughput.Supplementary TablesSupplementary Table 1: Detailed runtime statistics for all modelsSummary statistics of processing times for each model and setting. Data that complement the runtime comparison shown in Supplementary Figure 2.Supplementary Table 2: Example outputs of sample attribute extraction from metadataExample tabular extraction generated by the LLM from project/sample metadata for projects classified as positive. The output includes user-specified items (genotype, tissue, treatment method, ABA concentration, treatment duration, etc.) and “Unspecified” outputs when metadata are missing.Supplementary Table 3: List of projects retrieved by keyword searches from the databaseA list of projects programmatically retrieved for the specified keywords and used in the metadata classification task by the language models.Supplementary Table 4: Ground-truth labels of the benchmark datasetBinary labels (positive/negative) curated by humans based on explicit statements in the integrated metadata text. Used as the reference for evaluating LLM classification outputs.Supplementary FilesSupplementary File 1: Integrated metadata text inputs used for LLM inferenceIntegrated text combining project-level overviews and sample-level metadata for each project.Supplementary File 2: Prompts used for the classification and extraction tasksFull text of Prompt 1 and Prompt 2 used for the classification task, as well as the prompt used for the extraction task.Supplementary File 3: Workflow outputs and LLM-generated results for all model conditionsOutput directories and files generated by the workflow, including programmatically retrieved metadata, JSON outputs for each model and prompt, and related outputs.

创建时间：

2026-02-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集