Not so weak-PICO: Leveraging weak supervision for Participants, Interventions, and Outcomes recognition for systematic review automation

NIAID Data Ecosystem2026-03-14 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.ncjsxkszr

下载链接

链接失效反馈

官方服务：

资源简介：

Objective: PICO (Participants, Interventions, Comparators, Outcomes) analysis is vital but time-consuming for conducting systematic reviews (SRs). Supervised machine learning can help fully automate it, but a lack of large annotated corpora limits the quality of automated PICO recognition systems. The largest currently available PICO corpus is manually annotated, which is an approach that is often too expensive for the scientific community to apply. Depending on the specific SR question, PICO criteria are extended to PICOC (C-Context), PICOT (T-timeframe), and PIBOSO (B-Background, S-Study design, O-Other) meaning the static hand-labelled corpora need to undergo costly re-annotation as per the downstream requirements. We aim to test the feasibility of designing a weak supervision system to extract these entities without hand-labelled data. Methodology: We decompose PICO spans into its constituent entities and re-purpose multiple medical and non-medical ontologies and expert-generated rules to obtain multiple noisy labels for these entities. These labels obtained using several sources are then aggregated using simple majority voting and generative modelling approaches. The resulting programmatic labels are used as weak signals to train a weakly-supervised discriminative model and observe performance changes. We explore mistakes in the currently available PICO corpus that could have led to inaccurate evaluation of several automation methods. Results: We present Weak-PICO, a weakly-supervised PICO entity recognition approach using medical and non-medical ontologies, dictionaries and expert-generated rules. Our approach does not use hand-labelled data. Conclusion: Weak supervision using weak-PICO for PICO entity recognition has encouraging results, and the approach can potentially extend to more clinical entities readily. Methods This upload contains four main zip files. ds_cto_dict.zip: This zip file contains the four distant supervision dictionaries (P: participant.txt, I = intervention.txt, intervetion_syn.txt, O: outcome.txt) generated from clinicaltrials.gov using the Methodology described in Distant-CTO (https://aclanthology.org/2022.bionlp-1.34/). These dictionaries were used to create distant supervision labelling functions as described in the Labelling sources subsection of the Methodology. The data was derived from https://clinicaltrials.gov/ handcrafted_dictionaries.zip: This zip folder contains three files 1) gender_sexuality.txt: a list of possible genders and sexual orientations found across the web. The list needs to be more comprehensive. 2) endpoints_dict.txt: contains outcome names and the names of questionnaires used to measure outcomes assembled from PROM questionnaires and PROMs. and 3) comparator_dict: contains a list of idiosyncratic comparator terms like a sham, saline, placebo, etc., compiled from the literature search. The list needs to be more comprehensive. test_ebm_correctedlabels.tsv: EBM-PICO is a widely used dataset with PICO annotations at two levels: span-level or coarse-grained and entity-level or fine-grained. Span-level annotations encompass the full information about each class. Entity-level annotations cover the more fine-grained information at the entity level, with PICO classes further divided into fine-grained subclasses. For example, the coarse-grained Participant span is further divided into participant age, gender, condition and sample size in the randomised controlled trial. This dataset comes pre-divided into a training set (n=4,933) annotated through crowd-sourcing and an expert annotated gold test set (n=191) for evaluation. The EBM-PICO annotation guidelines caution about variable annotation quality. Abaho et al. developed a framework to post-hoc correct EBM-PICO outcomes annotation inconsistencies. Lee et al. studied annotation span disagreements suggesting variability across the annotators. Low annotation quality in the training dataset is excusable, but the errors in the test set can lead to faulty evaluation of the downstream ML methods. We evaluate 1% of the EBM-PICO training set tokens to gauge the possible reasons for the fine-grained labelling errors and use this exercise to conduct an error-focused PICO re-annotation for the EBM-PICO gold test set. The file 'test_ebm_correctedlabels.tsv' has error corrected EBM-PICO gold test set. This dataset could be used as a complementary evalution set along with EBM-PICO test set. error_analysis.zip: This .zip file contains three .tsv files for each PICO class to identify possible errors in about 1% (about 12,962 tokens) of the EBM-PICO training set.

研究目标：PICO（参与者、干预措施、对照措施、结局指标）分析在开展系统评价（systematic reviews, SRs）时至关重要但极为耗时。监督式机器学习可助力实现其全自动化流程，但缺乏大规模标注语料库限制了自动化PICO识别系统的性能。当前可用的最大规模PICO语料库均为人工标注，该方法对于科研群体而言往往成本过高，难以推广应用。根据具体的系统评价研究问题，PICO标准可拓展为PICOC（C代表背景Context）、PICOT（T代表时间框架Timeframe）以及PIBOSO（B代表背景Background、S代表研究设计Study design、O代表其他Other），这意味着静态的手工标注语料库需要根据下游任务需求进行成本高昂的重新标注。本研究旨在验证构建弱监督（weak supervision）系统以无需手工标注数据即可提取上述实体的可行性。研究方法：我们将PICO片段拆解为其构成实体，并复用多种医学与非医学本体及专家制定的规则，为这些实体生成多个含噪标签。随后通过简单多数投票与生成式建模（generative modelling）方法，对多源标签进行聚合。所得的程序化标签将作为弱信号，用于训练弱监督判别模型（discriminative model），并观察其性能变化。我们还对当前可用的PICO语料库中可能导致多种自动化方法评估不准确的错误展开了探索。研究结果：本文提出Weak-PICO，一种基于医学与非医学本体、词典及专家制定规则的弱监督PICO实体识别方法。该方法无需使用手工标注数据。研究结论：采用Weak-PICO进行弱监督PICO实体识别取得了令人鼓舞的结果，该方法可便捷地拓展至更多临床实体的识别任务。数据集说明：本次上传包含四个主要压缩文件。 1. ds_cto_dict.zip：该压缩文件包含四个远监督（distant supervision）词典（P：participant.txt、I = intervention.txt、intervetion_syn.txt、O：outcome.txt），这些词典基于Distant-CTO（https://aclanthology.org/2022.bionlp-1.34/）中描述的方法从clinicaltrials.gov生成。如研究方法的「标注源」小节所述，这些词典被用于构建远监督标注函数。数据来源于https://clinicaltrials.gov/ 2. handcrafted_dictionaries.zip：该压缩文件夹包含三个文件：① gender_sexuality.txt：收录了网络上公开的各类性别与性取向相关词汇，该列表仍有待完善；② endpoints_dict.txt：收录了结局指标名称及用于测量结局指标的问卷名称，汇编自PROM（Patient-Reported Outcome Measures）问卷及PROMs数据集；③ comparator_dict：收录了一系列个性化对照术语（如假手术、生理盐水、安慰剂等），由文献检索整理得到，该列表仍有待完善。 3. test_ebm_correctedlabels.tsv：EBM-PICO是一款广泛使用的数据集，其PICO标注分为两个层级：片段级（粗粒度）与实体级（细粒度）。片段级标注涵盖了每个类别的完整信息，实体级标注则提供更细粒度的实体层级信息，PICO类别可进一步划分为细粒度子类。例如，粗粒度的「参与者」片段可进一步划分为随机对照试验中的参与者年龄、性别、病症与样本量。该数据集预先划分为众包（crowdsourcing）标注的训练集（n=4,933）与专家标注的金标准测试集（n=191）用于模型评估。EBM-PICO的标注指南提示其标注质量存在差异。Abaho等人提出了一个事后校正EBM-PICO结局指标标注不一致性的框架，Lee等人则研究了标注片段的分歧问题，指出了标注者间的变异性。训练数据集的标注质量欠佳尚可接受，但测试集中的错误可能会导致下游机器学习方法的评估出现偏差。我们对1%的EBM-PICO训练集样本进行了标注分析，以探究细粒度标注错误的潜在原因，并以此为基础对EBM-PICO金标准测试集开展了聚焦于错误修正的PICO重标注。文件"test_ebm_correctedlabels.tsv"包含了经过错误校正的EBM-PICO金标准测试集。该数据集可与EBM-PICO测试集一同作为互补的评估数据集。 4. error_analysis.zip：该压缩文件包含三个.tsv文件，分别对应一个PICO类别，用于识别约1%（约12,962个标记）的EBM-PICO训练集中可能存在的标注错误。

创建时间：

2022-12-13