TextTransfer: Datasets for Impact Detection

Name: TextTransfer: Datasets for Impact Detection
Creator: Illinois Data Bank
License: 暂无描述

doi.org2025-01-16 收录

下载链接：

https://doi.org/10.13012/B2IDB-9934303_V1

下载链接

链接失效反馈

官方服务：

资源简介：

Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. This is a repository to save datasets and codes related to this project. Please read and cite the following paper if you would like to use the data: Becker M., Han K., Werthmann A., Rezapour R., Lee H., Diesner J., and Witt A. (2024). Detecting Impact Relevant Sections in Scientific Research. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING). This folder contains the following files: evaluation_20220927.ods: Annotated German passages (Artificial Intelligence, Linguistics, and Music) - training data annotated_data.big_set.corrected.txt: Annotated German passages (Mobility) - training data incl_translation_all.csv: Annotated English passages (Artificial Intelligence, Linguistics, and Music) - training data incl_translation_mobility.csv: Annotated German passages (Mobility) - training data ttparagraph_addmob.txt: German corpus (unannotated passages) model_result_extraction.csv: Extracted impact-relevant passages from the German corpus based on the model we trained rf_model.joblib: The random forest model we trained to extract impact-relevant passages Data processing codes can be found at: https://github.com/khan1792/texttransfer

影响评估作为一项不断发展的研究领域，旨在衡量和预测项目或计划的可能影响。科学研究的影響評估构成一个充满活力的子领域，与影响评估紧密相连。一个反复出现的难题在于缺乏一个高效的框架，该框架能够促进对冗长报告和文本标注的分析。为解决此问题，我们提出了一种框架，通过识别项目报告中指示潜在影响的相应部分，来自动评估科学研究项目的影响。我们采用了一种混合方法，结合人工标注与监督机器学习，从项目报告中提取这些段落。这是一个用于保存与该项目相关的数据集和代码的仓库。如需使用数据，请阅读并引用以下论文：Becker M.，Han K.，Werthmann A.，Rezapour R.，Lee H.，Diesner J.，及Witt A.（2024）。检测科学研究中相关影响的部分。2024年联合国际计算语言学、语言资源与评估会议（LREC-COLING）。本文件夹包含以下文件：evaluation_20220927.ods：标注的德语文本（人工智能、语言学和音乐）- 训练数据；annotated_data.big_set.corrected.txt：标注的德语文本（移动性）- 训练数据；incl_translation_all.csv：标注的英文文本（人工智能、语言学和音乐）- 训练数据；incl_translation_mobility.csv：标注的德语文本（移动性）- 训练数据；ttparagraph_addmob.txt：德语语料库（未标注的段落）；model_result_extraction.csv：基于我们所训练的模型从德语文语料库中提取的相关影响段落；rf_model.joblib：用于提取相关影响段落的随机森林模型。数据处理代码可在以下链接找到：https://github.com/khan1792/texttransfer

提供机构：

Illinois Data Bank

5,000+

优质数据集

54 个

任务类型

进入经典数据集