Automated data extraction of unstructured and grey literature data in health research: a mapping review of the current research literature

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://doi.org/10.7910/DVN/7N2YWZ

下载链接

链接失效反馈

官方服务：

资源简介：

Background: The amount of grey literature and ‘softer’ intelligence from social media or websites is increasing. Compared with the long lead-times of producing high-quality peer-reviewed health information this is causing a demand for new and creative ways to provide prompt input for secondary research. Automated data extraction could potentially make unstructured data more accessible to individuals conducting various types of literature reviews. This is the first review of automated data extraction from health-related grey literature and soft intelligence, with a focus on (semi)automating horizon scans, health technology assessments, evidence maps, or other secondary literature reviews. Methods: We searched MEDLINE (PubMed), Scopus, ACL Anthology, dblp, arXiv (computer science), and MedRxiv to cover both health- and computer-science literature sources. After deduplication, 10% of the search results were screened by two reviewers, the remainder was single-screened by one reviewer up to an estimated 95% sensitivity; and screening was stopped early after screening an additional 1000 results without finding another included paper. All full texts were retrieved, screened, and extracted by a single reviewer and 10% were checked by a second reviewer. Results: We included 84 papers for 7 tools and 76 methods covering automation for health-related social media, internet fora, news, patents, government agencies and charities, or trial registers. From each paper we extracted data for three research questions: Firstly, important functionalities for users of the tool or method, including features, mined data sources, type of mined data such as ‘diseases’. Secondly, information about the level of support such as tool-availability, metrics and results of evaluation, extend of automation such as document classification or entity recognition. Thirdly, information about practical challenges and research gaps. Conclusions: poor availability of code, data, and end-user tools leads to low transparency regarding performance and duplication of work in the space of automated data extraction from grey and soft intelligence. Financial implications, scalability of proposed methods, integration into downstream review workflows, and meaningful quantitative and qualitative evaluations should be carefully planned before starting to develop an automation tool, given the vast amounts of data and opportunities those tools offer to inform secondary research.

创建时间：

2023-06-22