five

Elsevier's data and code for the bioCADDIE 2016 Dataset Retrieval Challenge

收藏
NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://data.mendeley.com/datasets/zd9dxpyybg
下载链接
链接失效反馈
官方服务:
资源简介:
The Elsevier DataSearch (https://datasearch.elsevier.com) team participated in the bioCADDIE 2016 Dataset Retrieval Challenge. The results of the Challenge, along with the example and test queries, can be found here: https://biocaddie.org/biocaddie-2016-dataset-retrieval-challenge We have submitted a paper to DATABASE: The Journal of Biological Databases and Curation that details our work in the Challenge (to be published in the latter half of 2017). The attached file, elsevier-submission.zip, contains elsevier[1-5].txt, which correspond to the five-run submissions as described in the paper. The following describes the code that we developed for the Challenge: Aspire Content Processing by Search Technologies (https://www.searchtechnologies.com/en-gb/aspire): Dictionary.xml - Loads dictionaries (MeSH, Genes, Solr fields) into Aspire so that they can be used to identify concepts in text (document or query). QueryAnalyzer.xml - Receives a query, identifies concepts using the dictionaries and returns a response containing information about the concepts in the query. ProcessJSON.xml - Processes the JSON documents (Flattens the metadata; Identifies MeSH and Gene concepts and embeds them in the text; Prepares the document to be indexed by Solr). ProcessJSONSimple.xml - Enables JSON documents which have previously been created by ProcessJosn.xml to be sent to Solr without any further processing. This is much quicker than having to run ProcessJSONSimple.xml again; Prepares the document to be indexed by Solr. All other aspects of Aspire (Aspire framework, content source to process a folder of JSON files, submission to Solr) are standard Aspire features with no customisation. Solr: Biocaddie.qpl - QPL file for processing a search query by sending a request to QueryAnalyzer.xml in Aspire, parsing the response and constructing a Lucene query. Elsevier-solr.zip - Java project for a custom Solr Token Filter to index concept IDs in the same position as the words to which they relate. All other aspects of Solr are standard Solr or QPL.. Dictionary Creation: MeSH.groovy - Groovy script to convert a MeSH dictionary in ASCII format into a dictionary which can be used in Aspire. Genes.groovy - Groovy script to convert a Gene dictionary into a dictionary which can be used in Aspire. The file biocaddie-infosys-master_files.zip contains the following: SolrQueryGen - Generates Solr queries from text. It supports unigram, gazetteer lookup, lemmatisation and word embedding expansion. JudgementUI - UI for bioCADDIE manual judgments. Additional utilities: NLP4J - Natural language parsing (tokenisation, lemmatisation, part of speech tagging, etc.). PseudoRelevanceFeedback - Another approach, but not integrated. BioCaddieSpark – Apache Spark jobs to load data and process, index into Solr. BioCaddieServices - Backend services for Judgment UI. Any questions about the code should be directed to datasearch-support@elsevier.com.

爱思唯尔(Elsevier)DataSearch团队(https://datasearch.elsevier.com)参与了2016年bioCADDIE数据集检索挑战赛(bioCADDIE 2016 Dataset Retrieval Challenge)。该挑战赛的赛事结果、示例查询与测试查询均可通过以下链接获取:https://biocaddie.org/biocaddie-2016-dataset-retrieval-challenge 我们已向《数据库:生物数据库与管理期刊》(DATABASE: The Journal of Biological Databases and Curation)提交了一篇详述本次挑战赛参赛工作的论文,该论文计划于2017年下半年发表。附件文件elsevier-submission.zip包含elsevier[1-5].txt,分别对应论文中提及的5轮提交结果。 以下为我们为本次挑战赛开发的代码说明: ### Search Technologies公司开发的Aspire内容处理框架(https://www.searchtechnologies.com/en-gb/aspire): - Dictionary.xml:将词典(医学主题词表(MeSH)、基因词典、Solr字段词典)加载至Aspire,用于识别文本(文档或查询)中的概念。 - QueryAnalyzer.xml:接收查询请求,通过上述词典识别其中的概念,并返回包含查询内概念相关信息的响应结果。 - ProcessJSON.xml:处理JSON文档(扁平化元数据;识别医学主题词表(MeSH)与基因概念并嵌入文本;为Solr索引预处理文档)。 - ProcessJSONSimple.xml:支持将已通过ProcessJSON.xml生成的JSON文档直接发送至Solr,无需额外处理,相较于重复运行该流程效率更高;同时完成Solr索引所需的文档预处理工作。 Aspire的其余所有功能(Aspire框架、用于处理JSON文件文件夹的内容源、Solr提交功能)均为标准Aspire内置特性,未进行自定义修改。 ### Solr相关组件: - Biocaddie.qpl:用于处理搜索查询的QPL文件,具体流程为向Aspire中的QueryAnalyzer.xml发送请求,解析响应结果并构建Lucene查询语句。 - Elsevier-solr.zip:用于自定义Solr Token过滤器(Solr Token Filter)的Java项目,可将概念ID与其关联词汇置于同一索引位置。 Solr的其余所有功能均为标准Solr或QPL内置功能。 ### 词典构建脚本: - MeSH.groovy:用于将ASCII格式的医学主题词表(MeSH)词典转换为Aspire可用格式的Groovy脚本。 - Genes.groovy:用于将基因词典转换为Aspire可用格式的Groovy脚本。 文件biocaddie-infosys-master_files.zip包含以下工具: - SolrQueryGen:从文本生成Solr查询的工具,支持一元语法(unigram)、词典查找(gazetteer lookup)、词形还原(lemmatisation)与词嵌入扩展(word embedding expansion)功能。 - JudgementUI:用于bioCADDIE人工标注的用户界面。 ### 附加工具集: - NLP4J:自然语言解析工具(支持分词、词形还原、词性标注等功能)。 - PseudoRelevanceFeedback:另一类未集成进系统的检索方法。 - BioCaddieSpark:用于加载数据、处理并索引至Solr的Apache Spark作业集。 - BioCaddieServices:面向JudgementUI的后端服务。 若对该代码存在任何疑问,请发送邮件至datasearch-support@elsevier.com。
创建时间:
2017-06-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作