A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI)

Name: A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI)
Creator: Zenodo
Published: 2024-09-04 08:55:21
License: 暂无描述

Zenodo2024-09-04 更新2026-06-04 收录

下载链接：

https://zenodo.org/record/5591987

下载链接

链接失效反馈

官方服务：

资源简介：

A controlled vocabulary for research and innovation in the field of Artificial Intelligence (AI) This controlled vocabulary of keywords related to the field of Artificial Intelligence (AI) was built by SIRIS Academic in collaboration with ART-ER (the R&I and sustainable development in-house agency of the Emilia-Romagna region in Italy) and the Generalitat de Catalunya (the regional government of Catalonia, Spain), in order to identify AI research, development and innovation activities. The work was carried out by consulting domain experts' advice and it was ultimately applied to inform regional strategies on AI and research and innovation policy. The aim of this vocabulary is to enable one to retrieve texts (e.g. R&D projects and scientific publications) featuring the concepts included in the present vocabulary in their titles and abstracts, assuming that these records have a certain contribution of applications, techniques and issues, in the domain of AI. The present effort was carried out because, despite the high number of contributions and technological developments in the field of AI, there is no closed or static vocabulary of concepts that allows to unequivocally define the boundaries of what should be considered “an Artificial Intelligence intellectual product” (or what should not). Indeed, the literature presents different definitions of the domain, with visions that could be contradictory. AI encompasses today a wide variety of subdomains, ranging from general purpose areas such as learning and perception to more specific ones such as autonomous vehicle driving, theorem proving, or industrial process monitoring. AI synthesises and automates intellectual tasks, and is therefore potentially relevant to any area of human intellectual activity. In this sense, it is a genuinely universal and multidisciplinary field. AI draws upon disciplines as diverse as cybernetics, mathematics, philosophy, sociology and economics. As a ground for the construction of the AI controlled vocabulary, an initial set of concepts was taken from different subdomains of the ACM Computing Classification System 2012, to define the boundaries of the AI domain. Notably, although some relevant AI subdomains have an independent category in the ACM taxonomy outside of AI, they have been included in the list of subdomains. In order to align the ACM taxonomical definition with the Catalan Strategy of AI, CATALONIA.AI, in version 1 of this resource the emerging area of AI Ethics was included in the vocabulary, while some other categories which are not relevant for the objectives were removed from the subdomains list. In the current version 2, the classification and the labels of the subdomains have been revised because of the evolution of the field. Some fields have been grouped in order to reduce the overlap between subdomains and to provide a taxonomy that makes more sense for the analysis of R&I ecosystems. The different subdomains in the versions are presented in the following table: Version Subdomains Version 2 (1) Machine learning and deep learning; (2) Computer Vision; (3) Natural Language Processing and speech recognition; (4) Intelligent agents, planning, scheduling, problem-solving, control methods, and search; (5) Expert Systems, Knowledge representation and reasoning; (6) AI Ethics. Version 1 (1) General, (2) Machine Learning, (3) Computer Vision, (4) Natural Language Processing, (5) Knowledge Representation and Reasoning, (6) Distributed Artificial Intelligence, (7) Expert Systems, Problem-Solving, Control Methods and Search and (8) AI Ethics. Although a keyword rule-based approach suffers from the major shortcomings of not capturing all the lexical and linguistic variants of specific concepts nor the context of the words - namely, keyword-based approaches would miss relevant texts if the specific pattern is not matched during the search - the present vocabulary allowed us to obtain fairly good results, due to the specificity of the concepts describing the AI domain. Furthermore, an understandable and transparent controlled vocabulary allows a better control of the final results and the final definition of the domain borders. Also, a plain list of terms allows a much easier and interactive engagement of interested stakeholders with different degrees of knowledge (such as, for instance, domain experts, policy-makers and potential users) who can make use of vocabulary to retrieve pertinent literature or to enrich the resource itself. The vocabulary has been built taking advantage of advanced language models and resources from knowledge datasets such as arXiv, DBpedia and Wikipedia. The resulting vocabulary comprises 833 keywords, and has been validated by experts from several universities in Emilia-Romagna and Catalonia. The version 0.5 of this resource was developed by the SIRIS Academic in 2019 in collaboration with ART-ER, Emilia-Romagna (Quinquillá et al., 2020), the version 1 was the result of an update done in 2020 in collaboration with the Generalitat de Catalunya, and the current version (version 2) has resulted in 2021 from the collaboration with ART-ER and the integration of an additional set of keywords provided by the Artificial Intelligence and Intelligence Systems (AIIS) Laboratory of the CINI (Consorzio interuniversitario nazionale per l’informatica based in Rome, Italy). The methodology for the construction of the controlled vocabulary is presented in the following steps: An initial set of scientific publications was collected by retrieving the following records as a weakly-supervised (in the sense that records are linked to AI by their taxonomy and not by a manual label) dataset in the domain of Artificial Intelligence : Publications from Scopus with the keyword “Artificial Intelligence” Publications from arXiv in the category “Artificial Intelligence” Publications in relevant journals in the scientific domain of “Artificial Intelligence” An automated algorithm was used to retrieve, from the APIs of DBpedia, a series of terms that have some categorical relationships (i.e. those that are indexed as “sub-categories of”, “equivalent to”, among other relations in DBpedia) with the Artificial Intelligence concept and with the AI categories in the ACM taxonomy. The DBpedia tree has been exploited down to the level 3, and the relevant categories have been manually selected (for instance: Classification algorithms, Machine learning or Evolutionary computation) and others were ignored (for instance: Artificial intelligence in fiction, Robots or History of artificial intelligence) because they were not relevant, or not specifically in the domain. The keywords in publications in the dataset were extracted from the keyword sections and from the abstracts. The keywords with a higher TF-IDF, using an IDF matrix in the open domain, have been selected. The co-occurrence of keywords with categories in specific AI subdomain and a clusterization of the main keywords has been used for a categorization of the keywords at the thematic level. This list of keywords tagged by thematic category has been manually revised, removing the non-pertinent keywords and changing the wrong categorizations by fields. The weak-supervised dataset in the domain of Artificial Intelligence is used to train a Word2Vec (Mikolov et al., 2013) word embedding model (a machine learning model based on neural networks). The terms’ list is then enriched by means of automatic methods, which are run in parallel: The trained Word2Vec model is used to select, among the indexed keywords of the reference corpus, all terms “semantically close” to the initial set of words. This step is carried out to select terms that might not appear in the texts themselves, but that were deemed pertinent to label the textual records. Further, terms that are mentioned in the texts of the reference corpus and that are valued by the trained Word2Vec model as “semantically close” to the initial set of words are also retained. This step is performed to include in the controlled vocabulary a series of terms that are related to the focus of the SDGs and which are used by practitioners. The final list produced by steps 2-6 is manually revised. The definition of the vocabulary does not, per se, allow to identify STI contributions to AI: this activity in fact boils down to actually matching the terms in the controlled vocabulary to the content of the gathered STI textual records. To successfully carry out this task, a series of pattern matching rules must be defined to capture possible variants of the same concept, such as permutations of words within the concept and/or the presence of null words to be skipped. For this reason, we have carefully crafted matching rules that take into account permutations of words and that allow words within concept to be within a certain distance. Some relatively ambiguous keywords (which may match unwanted pieces of text), have a set of associated “extra” terms. These “extra” terms are defined as further terms that must co-appear, in the same sentence, together with their associated ambiguous keywords. Finally, each keyword in the vocabulary was assigned one or more AI subdomains, so that the vocabulary can also be used to tag collections of texts within narrower AI sub-domains. In order to complement the alignment between keywords and subdomains, a set of subdomain-specific keywords have been defined to better capture the scope of the subdomains. These allow better characterization of subdomains that are more difficult to define only by means of unambiguous specific concepts, or that overlap with the wide “machine learning” subdomain (example: machine learning applied to object recognition or text translation). The alignment between keywords and subdomains, and these keyword lists of each subdomain, have been applied to capture AI subdomains in research outputs. Through this classification process, we have identified projects and publications related to AI, with a focus on mapping the research competencies in the AI domain in Emilia-Romagna. The resulting research records have been reviewed by experts in the domain, given the occurrence of some false positives, which have been used to improve the approach. The final controlled vocabulary has been evaluated with an external test set, proposed by (Dunham et al., 2020). The test set consists of the abstract of 10,606 papers published in the arXiv repository, of which 1,076 within the Artificial Intelligence subcategories and 9,530 in arXiv categories other than Artificial Intelligence. Evaluating the controlled vocabulary on this data set, we observe accuracy of .94. However, because the pertinence of these publications to the field of AI is based solely on their taxonomic classification (i.e., on whether they are classified in the arXiv within Artificial Intelligence and not on a manual labelling), this evaluation can only yield an orientative performance assessment. The version 2 includes new keywords extracted from the (1) re-training of the enrichment pipeline (steps 5-6 in the methodology) considering as initial set of terms the version 1 of the vocabulary on a reference corpus of new publications, and (2) from the flat keywords list provided by the Artificial Intelligence and Intelligence Systems (AIIS) Lab of CINI (Consorzio interuniversitario nazionale per l’informatica). The keywords in (2) have been cleaned by calculating precision and f-measure on the dataset (Dunham et al., 2020), selecting those keywords with the highest scores, and being manually validated a posteriori. The AI controlled vocabulary has been applied in two practical cases, which have the purpose of identifying skills, stakeholders and capabilities, of a specific research ecosystem at the regional level. See the following references: Quinquillá, Arnau, Duran-Silva, Nicolau, Massucci, Francesco Alessandro, Fuster, Enric, Rondelli, Bernardo, Bologni, Leda, … Moretti, Giorgio. (2020). Text mining to identify skills, stakeholders and capabilities: the case of Artificial Intelligence in Emilia-Romagna. Zenodo. http://doi.org/10.5281/zenodo.3606342. Poster presented at: World Open Innovation Conference 2019 (WOIC); 11th december 2019, Rome, Italy. Bigas, E., Duran, N., Fuster, E., Parra, C., Fernández, T. (2021): “Anàlisi de l’especialització en intel·ligència artificial”. Col·lecció Monitoratge de la RIS3CAT, Generalitat de Catalunya http://catalunya2020.gencat.cat/web/.content/00_catalunya2020/Documents/estrategies/fitxers/analisi-especialitzacio-intelligencia-artificial.pdf Acknowledgements Tatiana Fernández (Direcció General de Promoció Econòmica, Competència i Regulació, de la Generalitat de Catalunya), Daniel Marco, Daniel Santanach and Eduard Balbuena (Departament de Polítiques Digitals i Administració Pública, de la Generalitat de Catalunya) Albert Sabater (Observatori d’Ètica en Intel·ligència Artificial i Universitat de Girona) Leda Bologni, Lucia Mazzoni and Giorgio Moretti (Art-ER) Prof. RIta Cucchiara and Dr. Lorenzo Baraldi (Università degli Studi di Modena e Reggio Emilia) Artificial Intelligence and Intelligence Systems (AIIS) Lab of CINI (Consorzio interuniversitario nazionale per l’informatica) Bibliography Bigas, E., Duran, N., Fuster, E., Parra, C., Fernández, T. (2021): “Anàlisi de l’especialització en intel·ligència artificial”. Col·lecció Monitoratge de la RIS3CAT, Generalitat de Catalunya http://catalunya2020.gencat.cat/web/.content/00_catalunya2020/Documents/estrategies/fitxers/analisi-especialitzacio-intelligencia-artificial.pdf Dunham, J.W., Melot, J., & Murdick, D. (2020). Identifying the Development and Application of Artificial Intelligence in Scientific Text. ArXiv, abs/2002.07143. Available at: https://arxiv.org/abs/2002.07143 Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. (2013). Efficient Estimation of Word Representations in Vector Space. 1-12. Quinquillá, Arnau, Duran-Silva, Nicolau, Massucci, Francesco Alessandro, Fuster, Enric, Rondelli, Bernardo, Bologni, Leda, … Moretti, Giorgio. (2020). Text mining to identify skills, stakeholders and capabilities: the case of Artificial Intelligence in Emilia-Romagna. Zenodo. http://doi.org/10.5281/zenodo.3606342. Poster presented at: World Open Innovation Conference 2019 (WOIC); 11th december 2019, Rome, Italy.

面向人工智能（Artificial Intelligence, AI）领域研究与创新的受控词汇表 本人工智能领域相关关键词受控词汇表由SIRIS Academic联合ART-ER（意大利艾米利亚-罗马涅大区研发创新与可持续发展内部机构）与加泰罗尼亚自治区政府（西班牙加泰罗尼亚地区政府，Generalitat de Catalunya）共同构建，旨在识别人工智能领域的研究、开发与创新活动。本工作通过征询领域专家意见完成，并最终用于为人工智能领域的区域战略及研发创新政策制定提供参考。本词汇表的目标是实现对标题与摘要中包含本词汇表所覆盖概念的文本（如研发项目与科学出版物）的检索，前提是这些记录在人工智能领域具备一定的应用、技术与问题相关贡献。开展本工作的原因在于：尽管人工智能领域成果与技术发展层出不穷，但目前尚无封闭或静态的概念词汇表，能够明确界定“人工智能智力成果”的范畴边界。当前学界对该领域的定义各异，甚至存在认知冲突。人工智能如今涵盖众多子领域，从通用领域（如学习与感知）到细分领域（如自动驾驶汽车、定理证明、工业过程监控）。它整合并自动化智力任务，因此可应用于人类智力活动的几乎所有领域，是一门真正通用且多学科交叉的领域，其依托的学科包括控制论、数学、哲学、社会学与经济学等。本词汇表的构建初始概念集取自《ACM计算分类体系2012》（ACM Computing Classification System 2012）的不同子领域，以界定人工智能领域的边界。值得注意的是，尽管部分相关AI子领域在ACM分类体系中拥有独立于AI的类目，但仍被纳入本词汇表的子领域列表。为使ACM分类体系与加泰罗尼亚人工智能战略《CATALONIA.AI》对齐，本资源的第1版将新兴的人工智能伦理（AI Ethics）子领域纳入词汇表，并移除了部分与目标无关的类目。当前的第2版则因领域发展对分类体系与子领域标签进行了修订：部分类目被合并，以减少子领域间的重叠，并构建更适配研发创新生态系统分析的分类体系。两个版本的子领域如下表所示： 版本 子领域 版本2 (1) 机器学习与深度学习；(2) 计算机视觉；(3) 自然语言处理与语音识别；(4) 智能体、规划、调度、问题求解、控制方法与搜索；(5) 专家系统、知识表示与推理；(6) 人工智能伦理。 版本1 (1) 通用领域；(2) 机器学习；(3) 计算机视觉；(4) 自然语言处理；(5) 知识表示与推理；(6) 分布式人工智能；(7) 专家系统、问题求解、控制方法与搜索；(8) 人工智能伦理。尽管基于关键词的规则方法存在固有缺陷：无法捕捉特定概念的所有词汇与语言变体，也无法考量文本语境——即若搜索时未匹配到特定模式，基于关键词的方法会遗漏相关文本，但本词汇表凭借其对人工智能领域概念的精准定义，仍取得了较为理想的效果。此外，清晰易懂且透明的受控词汇表可更好地管控最终结果与领域边界的最终定义。同时，简洁的术语列表可让不同知识水平的利益相关方（如领域专家、政策制定者与潜在用户）更便捷地互动参与：他们可利用该词汇表检索相关文献，或丰富本资源本身。本词汇表依托arXiv、DBpedia与维基百科等知识数据集的先进语言模型与资源构建而成。最终词汇表包含833个关键词，并已通过艾米利亚-罗马涅与加泰罗尼亚地区多所高校的专家验证。本资源的0.5版由SIRIS Academic于2019年与ART-ER（艾米利亚-罗马涅大区）合作开发（Quinquillá等人，2020）；第1版为2020年与加泰罗尼亚自治区政府合作完成的更新版本；当前的第2版则于2021年由SIRIS Academic与ART-ER合作完成，并整合了意大利国家大学计算机联合体（Consorzio interuniversitario nazionale per l’informatica, CINI）人工智能与智能系统（Artificial Intelligence and Intelligence Systems, AIIS）实验室提供的额外关键词集。本受控词汇表的构建方法分为以下步骤： 1. 以弱监督方式（即记录通过分类体系而非人工标注与人工智能领域关联）采集人工智能领域的初始科研出版物数据集，具体包括：Scopus数据库中关键词为“Artificial Intelligence”的出版物；arXiv分类为“Artificial Intelligence”的出版物；“人工智能”科学领域相关期刊的出版物。 2. 利用自动化算法从DBpedia的API中检索一系列与人工智能概念及ACM分类体系中的AI类目存在分类关系的术语（即被索引为“子类别”“等价类别”等DBpedia关系的术语）。DBpedia分类树被遍历至第3层级，相关类目经人工筛选（如保留“分类算法”“机器学习”“进化计算”，剔除“科幻中的人工智能”“机器人”“人工智能发展史”等与领域无关或非核心的类目）。 3. 从数据集出版物的关键词字段与摘要中提取关键词，选取在开放域逆文档频率（Inverse Document Frequency, IDF）矩阵中词频-逆文档频率（Term Frequency-Inverse Document Frequency, TF-IDF）值较高的关键词。 4. 利用关键词与特定AI子领域类别的共现关系以及主关键词的聚类结果，实现关键词的主题层级分类。 5. 对上述按主题类别标记的关键词列表进行人工修订，移除不相关关键词并修正错误分类。 6. 利用上述弱监督数据集训练Word2Vec（Mikolov等人，2013）词嵌入模型（一种基于神经网络的机器学习模型）。 7. 通过并行运行的自动方法丰富术语列表： a. 利用训练好的Word2Vec模型，从参考语料库的索引关键词中选取所有与初始词集“语义相近”的术语，以纳入可能未直接出现在文本中但被认为适合标记文本记录的术语； b. 保留参考语料库文本中提及的、且经训练后的Word2Vec模型判定为与初始词集“语义相近”的术语，以纳入与可持续发展目标（Sustainable Development Goals, SDGs）聚焦点相关且被从业者使用的一系列术语。 8. 对步骤2至7生成的最终术语列表进行人工修订。本词汇表本身无法直接识别科学、技术与创新（Science, Technology and Innovation, STI）对人工智能的贡献：该活动实际上需将受控词汇表中的术语与采集到的STI文本记录的内容进行匹配。为顺利完成该任务，需定义一系列模式匹配规则，以捕捉同一概念的不同变体，如概念内单词的排列顺序、或可跳过的空词。因此，我们精心设计了匹配规则，考虑单词排列顺序，并允许概念内的单词保持一定间距。部分相对模糊的关键词（可能匹配到无关文本）附带一组“额外”术语：这些“额外”术语被定义为必须与关联的模糊关键词在同一句中共同出现的术语。最后，词汇表中的每个关键词被分配至一个或多个AI子领域，使本词汇表还可用于对更细分的AI子领域内的文本集合进行标记。为补充关键词与子领域的对齐关系，还定义了一组子领域专属关键词，以更好地捕捉子领域的范围，这尤其适用于仅通过明确的特定概念难以界定，或与宽泛的“机器学习”子领域存在重叠的子领域（如应用于目标识别或文本翻译的机器学习）。关键词与子领域的对齐关系，以及各子领域的关键词列表，已被用于在科研产出中识别AI子领域。通过该分类流程，我们已识别出与人工智能相关的项目与出版物，重点是绘制艾米利亚-罗马涅大区人工智能领域的研究能力图谱。鉴于存在少量假阳性结果，相关研究记录已由领域专家审核，这些结果被用于改进本方法。最终的受控词汇表已通过由Dunham等人（2020）提出的外部测试集进行评估。该测试集包含arXiv知识库中10606篇论文的摘要，其中1076篇属于人工智能子分类，9530篇属于arXiv中非人工智能的其他分类。在该数据集上评估受控词汇表后，我们得到了0.94的准确率。不过，由于这些出版物与人工智能领域的关联性仅基于其分类体系标注（即仅依据是否在arXiv中被归类为人工智能，而非人工标注），该评估仅能提供方向性的性能评价。第2版包含的新关键词来源于两部分：1. 以第1版词汇表的术语集为初始词集，在新出版物的参考语料库上重新运行丰富流程（方法中的步骤5-6）所提取的关键词；2. 由CINI人工智能与智能系统（AIIS）实验室提供的扁平化关键词列表。对于第2部分的关键词，我们通过在Dunham等人（2020）的数据集上计算准确率与F1值进行清洗，选取得分最高的关键词，并在事后进行人工验证。本人工智能受控词汇表已应用于两个实际案例，旨在识别特定区域层面研究生态系统的技能、利益相关方与能力。相关参考文献如下： Quinquillá, Arnau, Duran-Silva, Nicolau, Massucci, Francesco Alessandro, Fuster, Enric, Rondelli, Bernardo, Bologni, Leda, … Moretti, Giorgio. (2020). Text mining to identify skills, stakeholders and capabilities: the case of Artificial Intelligence in Emilia-Romagna. Zenodo. http://doi.org/10.5281/zenodo.3606342. Poster presented at: World Open Innovation Conference 2019 (WOIC); 11th december 2019, Rome, Italy. Bigas, E., Duran, N., Fuster, E., Parra, C., Fernández, T. (2021): "Anàlisi de l’especialització en intel·ligència artificial". Col·lecció Monitoratge de la RIS3CAT, Generalitat de Catalunya http://catalunya2020.gencat.cat/web/.content/00_catalunya2020/Documents/estrategies/fitxers/analisi-especialitzacio-intelligencia-artificial.pdf 致谢 塔蒂亚娜·费尔南德斯（加泰罗尼亚自治区政府经济促进、竞争与监管总局）、丹尼尔·马尔科、丹尼尔·桑塔纳赫与爱德华德·巴尔布埃纳（加泰罗尼亚自治区政府数字政策与公共行政部）；阿尔伯特·萨巴特（人工智能伦理观测站与赫罗纳大学）；莱达·博洛尼、露西娅·马佐尼与乔治·莫雷蒂（ART-ER）；丽塔·库基亚拉教授与洛伦佐·巴尔迪博士（摩德纳-雷焦艾米利亚大学）；意大利国家大学计算机联合体（CINI）人工智能与智能系统实验室。 参考文献 Bigas, E., Duran, N., Fuster, E., Parra, C., Fernández, T. (2021): "Anàlisi de l’especialització en intel·ligència artificial". Col·lecció Monitoratge de la RIS3CAT, Generalitat de Catalunya http://catalunya2020.gencat.cat/web/.content/00_catalunya2020/Documents/estrategies/fitxers/analisi-especialitzacio-intelligencia-artificial.pdf Dunham, J.W., Melot, J., & Murdick, D. (2020). Identifying the Development and Application of Artificial Intelligence in Scientific Text. ArXiv, abs/2002.07143. Available at: https://arxiv.org/abs/2002.07143 Mikolov, Tomas & Corrado, G.s & Chen, Kai & Dean, Jeffrey. (2013). Efficient Estimation of Word Representations in Vector Space. 1-12. Quinquillá, Arnau, Duran-Silva, Nicolau, Massucci, Francesco Alessandro, Fuster, Enric, Rondelli, Bernardo, Bologni, Leda, … Moretti, Giorgio. (2020). Text mining to identify skills, stakeholders and capabilities: the case of Artificial Intelligence in Emilia-Romagna. Zenodo. http://doi.org/10.5281/zenodo.3606342. Poster presented at: World Open Innovation Conference 2019 (WOIC); 11th december 2019, Rome, Italy.

提供机构：

Zenodo

创建时间：

2022-04-27