Aligning Keywords from Long Form Prose to Controlled Vocabulary
收藏osf.io2024-08-19 更新2025-03-22 收录
下载链接:
https://osf.io/2kub9
下载链接
链接失效反馈官方服务:
资源简介:
HIVE-4-MAT is a linked-data, automatic indexing application for vocabularies related to material science. In the past few months, work has been done to improve the performance of the keyword alignment algorithm so that it is faster, more accurate, and more flexible at the expense of precision. This presentation reports on the lessons learned in the process of refactoring this keyword alignment algorithm. Since HIVE-4-MAT has a somewhat broad scope, it provides a good use case for analyzing a keyword alignment pipeline from raw article text scraping to keyword extraction to keyword matching and alignment. The presentation will touch topics such as common pitfalls of web scraping, different strategies for preparing raw text for keyword extraction, the differences in goals between keyword extraction and keyword alignment, and the potential benefits and drawbacks of utilizing the concept of string distance in keyword alignment algorithms.
HIVE-4-MAT是一款针对材料科学相关词汇的关联数据、自动索引应用程序。在过去几个月中,对关键词对齐算法的性能进行了改进,以实现更快的速度、更高的准确性和更大的灵活性,尽管这在一定程度上牺牲了精度。本次报告将阐述在重构关键词对齐算法过程中所取得的教训。鉴于HIVE-4-MAT的应用范围较为广泛,它为分析从原始文章文本抓取到关键词提取,再到关键词匹配与对齐的关键词对齐流程提供了一个良好的案例。报告将涉及诸如网络抓取的常见陷阱、为关键词提取准备原始文本的不同策略、关键词提取与关键词对齐之间的目标差异,以及利用字符串距离概念在关键词对齐算法中的潜在益处与弊端等主题。
提供机构:
Center For Open Science



