Data cleaning and enrichment through data integration: networking the Italian academia
收藏DataONE2025-02-24 更新2025-04-26 收录
下载链接:
https://search.dataone.org/view/sha256:b583b4db2874926c7b8d8bad19da36c9a4021fea18d77573f228fad5e332f0ff
下载链接
链接失效反馈官方服务:
资源简介:
We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar.
Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts.
, The proposed network is built starting from two distinct data sources:
the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets)
the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).
By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes.
In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia
[https://doi.org/10.5061/dryad.wpzgmsbwj](https://doi.org/10.5061/dryad.wpzgmsbwj)
## Description of the data and file structure
This repository contains two main data files:
* `edge_data_AGG.csv`, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
* `Coauthorship_Network_AGG.graphml`, the full network in GraphML format.Â
along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):
* `University-City-match.xlsx`, an Excel file that maps the name of a university against the city where its respective headquarter is located;
* `Areas-SS-CINECA-match.xlsx`, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.
### Description of the main data files
The `Coauthorship_Network_AGG.graphml` is intended to be the core file which c...
创建时间:
2025-02-26



