Supplementary dataset and reproducible codes for LLM-assisted mapping feedstocks of eight conversion technologies from over 121,000 studies

DataONE2026-01-16 更新2026-01-24 收录

下载链接：

https://search.dataone.org/view/sha256:b5e3078ed578bd3ac60880f4c06b44c41c94ca1817c9302cb85994f099499e11

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset was developed to systematically characterise feedstock–technology relationships across eight major biomass conversion technologies by mining a large Scopus-derived bibliographic corpus (1887–2025; partial coverage for 2025). The workflow is LLM-assisted and fully reproducible, combining automated extraction of feedstock and technology phrases from bibliographic text fields (titles, abstracts, and keywords) with rule-based cleaning and a subsequent LLM-based validation step, followed by targeted manual curation for final release. The dataset is intended for use in technology landscape analyses, evidence synthesis, and comparative assessments of biomass conversion pathways, where consistent and traceable feedstock descriptors are required across a very large volume of studies. A data descriptor titled \"A large-scale, LLM-assisted and validated dataset of biomass and waste conversion technologies and feedstocks\" with the following abstract will published based on this dataset: Biomass, organic wastes and biogenic by-products are increasingly targeted for low-carbon fuels and value-added chemicals. However, strategic decision-making from a circular economy perspective requires a big-picture view of the relative significance of different conversion technologies in handling diverse feedstock portfolios, and no large-scale, cross-technology mapping of these portfolios is currently available. Thus, a literature-derived dataset was assembled, that links eight major waste-to-x valorisation technologies (gasification, pyrolysis, hydrothermal liquefaction, torrefaction, anaerobic digestion, aerobic digestion, fermentation and transesterification) to their reported feedstocks. Using the Scopus database, 121,365 records were retrieved with harmonised search strings, spanning publications from 1887 to 2025. This constrained yet scalable search strategy both facilitates automated extraction and validation and yields a rich dataset. Further, a large language model assisted workflow was implemented to extract candidate technology and feedstock phrases, followed by a two-level validation that combines rule-based cleaning with targeted LLM re-evaluation to minimise manual curation. The resulting dataset provides technology-specific, validated feedstock descriptors that supports comparative analyses and decision-support applications in a circular bioeconomy context.

创建时间：

2026-01-17