Dictionary Based Annotation at Scale with Spark SolrTextTagger and OpenNLP
收藏doi.org2025-01-16 收录
下载链接:
http://doi.org/10.17632/4xdkh7xdtt.1
下载链接
链接失效反馈官方服务:
资源简介:
Dictionary Matching is the inverse of full text search. It is the problem of finding all the matches of a list of strings in a single document. This is easy when the number of strings is small, but is far from trivial when dealing with millions of patterns to search. We describe a system to annotate large volumes of text held in Spark DataFrames using Solr to hold one or more dictionaries. The system supports tagging of exact matches in the incoming text using SolrTextTagger, a Solr plugin which wraps Lucene’s Finite State Transducer (FST) technology to provide a very low-memory matcher implementation. The system also supports fuzzy tagging by using OpenNLP to chunk the incoming text into phrases and matching various normalized forms of the phrases against the dictionary. The functionality is accessed from Spark via a map() call, and returns a list of 4-tuples consisting of the start and end character offsets of the match in the text, the entity ID that matched, and a confidence level indicator between 0 and 1, indicating the degree of match between the dictionary entity and the text segment that was matched. A modest Solr setup with 8 GB RAM and 30 GB disk can support up to 120 million dictionary entries from one or more dictionaries on a single box. Near infinite horizontal scaling can be achieved by routing specific sets of dictionaries to specific boxes.
词典匹配是全文检索的逆过程,其核心在于在单一文档中寻找一组字符串的所有匹配项。当字符串数量较少时,这一过程相对简单,然而,面对数百万个搜索模式时,其复杂性则远非显而易见。本系统旨在通过Solr平台对Spark DataFrames中存储的大量文本进行标注,Solr平台用于存储一个或多个词典。该系统支持通过SolrTextTagger(Solr插件,该插件封装了Lucene的有限状态转换器FST技术,以提供极低内存消耗的匹配实现)对输入文本中的精确匹配进行标记。此外,系统还支持通过OpenNLP将输入文本切分为短语,并将这些短语的标准化形式与词典进行匹配,以实现模糊标记。该功能可通过Spark的map()调用访问,并返回一个包含四个元素的元组列表,分别对应于匹配文本中匹配项的起始和结束字符偏移量、匹配的实体ID以及一个介于0和1之间的置信度指标,该指标表示词典实体与匹配文本段之间的匹配程度。一个配置合理的Solr系统,配备8GB RAM和30GB磁盘空间,可以支持单台服务器上多达一亿两千百万个词典条目。通过将特定的词典集合路由到特定的服务器,可以实现近乎无限的横向扩展。
提供机构:
doi.org



