Improving Scientific Information Extraction with Text Generation

Name: Improving Scientific Information Extraction with Text Generation
Creator: University of Notre Dame
Published: 2025-04-09 14:43:49
License: 暂无描述

DataCite Commons2025-04-09 更新2025-04-17 收录

下载链接：

https://curate.nd.edu/articles/dataset/Improving_Scientific_Information_Extraction_with_Text_Generation/28571045/1

下载链接

链接失效反馈

官方服务：

资源简介：

As research communities expand, the number of scientific articles continues to grow rapidly, with no signs of slowing. This information overload drives the need for automated tools to identify relevant materials and extract key ideas. Information extraction (IE) focuses on converting unstructured scientific text into structured knowledge (e.g., ontologies, taxonomies, and knowledge graphs), enabling intelligent systems to excel in tasks like document organization, scientific literature retrieval and recommendation, claim verification even novel idea or hypothesis generation. To pinpoint the scope of this thesis, I focus on the taxonomic structure in this thesis to represent the knowledge in the scientific domain. To construct a taxonomy from scientific corpora, traditional methods often rely on pipeline frameworks. These frameworks typically follow a sequence: first, extracting scientific concepts or entities from the corpus; second, identifying hierarchical relationships between the concepts; and finally, organizing these relationships into a cohesive taxonomy. However, such methods encounter several challenges: (1) the quality of the corpus or annotation data, (2) error propagation within the pipeline framework, and (3) limited generalization and transferability to other specific domains. The development of large language models (LLMs) offers promising advancements, as these models have demonstrated remarkable abilities to internalize knowledge and respond effectively to a wide range of inquiries. Unlike traditional pipeline-based approaches, generative methods harness LLMs to achieve (1) better utilization of their internalized knowledge, (2) direct text-to-knowledge conversion, and (3) flexible, schema-free adaptability. This thesis explores innovative methods for integrating text generation technologies to improve IE in the scientific domain, with a focus on taxonomy construction. The approach begins with generating entity names and evolves to create or enrich taxonomies directly via text generation. I will explore combining neighborhood structural context, descriptive textual information, and LLMs' internal knowledge to improve output quality. Finally, this thesis will outline future research directions.

提供机构：

University of Notre Dame

创建时间：

2025-04-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集