Core bibliometric Covid19 and comparable research dataset and code for the study "From intent to impact: Investigating the effects of open sharing commitments"
收藏Mendeley Data2024-05-17 更新2024-06-28 收录
下载链接:
https://zenodo.org/records/6582759
下载链接
链接失效反馈官方服务:
资源简介:
This document provides the underlying dataset for the bibliometric component for the 2022 study "From intent to impact: Investigating the effects of open sharing commitments" by Research Consulting and Science-Metrix. Before reproducing the study findings or re-using the underlying datasets for other purposes, please cautiously review their limitations in the study's technical annex and main report, available at: https://zenodo.org/communities/data-sharing-in-public-health-emergencies/ Particularly, note that there is an error rate in attribution of signatory status to journal publications and preprints; in their location within specific thematic disease-based areas; or computing of dimension such as identification of data availability statement sections; identification of data depisition mentions within data availability statement sections; or matching of preprints and journal publications. These error rates are expected and have been estimated, please consult the technical report for full details. Definition of data fields is provided is the table below: Column name Definition document_type preprint or journal publication doi digital object identifier arxiv_id arXiv preprint server's unique identifier for its preprints ssrn_id SSRN preprint server's unique identifier for its preprints. Note that some of these IDs are contained within the DOIs also assigned to some (but not all) SSRN preprints , in the form of "10.2139/ssrn." + 'ssrn_id' coalesce_id coalesce function applied to the DOI, arxiv_id and ssrn_id. Redundant for journal publications. preprint_server Preprint platform on which a preprint has been published, restricted to arXiv, bioRxiv, medRxiv and SSRN for this study. journal_title Publishing journal name in the case of a journal publication. year The set is restricted to 2020 and 2021 for Covid19 preprints and journal publications. HVRD journal publications restricted to 2018-2019. HVRD preprints were restricted to 2020-2021 instead, to compensate for the lac of year-normalization for preprints, and generally better control findings against the launch of medRxiv in 2019. publication_title Title of the individual journal publication or preprint, not that of the publishing journal or preprint server. authors First 100 researchers that appear as authors of a preprint or journal publication. These are not parsed and provided for qualitative validation or assessments rather than for further quantitative treatment. Covid19 Journal publications or preprints are coded 1 if they has been identified as falling into this thematic area through our queries (see the technical annex), 0 otherwise HVRD Human viral respiratory disease, the thematic area considered to be the closest to Covid19. Journal publications or preprints are coded 1 if they has been identified as falling into this thematic area through our queries (see the technical annex), 0 otherwise Journal_sig Journal publications where the publishing journal and/or its publishing house are Joint Statement signatories. Coded as 1 if they are signatories, 0 if not signatory, null if status could not be determined due to insufficient metadata. Not that all preprint servers included in this study are Joint Statement signatories. This category was fully removed from the models for preprints, rather than all preprints being assigned automatic signatory status. RPO_sig Journal publications and preprints where at least one author is affiliated with at least one research performing organization that is a Joint Statement signatory. Coded as 1 ifor signatory, 0 if not signatory, null if status could not be determined due to insufficient metadata. Funder_sig Journal publications and preprints where at least one funder supporting the research is a Joint Statement signatory. Coded as 1 ifor signatory, 0 if not signatory, null if status could not be determined due to insufficient metadata. Although funding is attributed to researchers rather than publications, funding metadata is more readily available at the second level. This approach also captures the flexible usage of financial resources that researchers may make accross mulitple concurrently ongoing research projects. overton_norm Year and subfield-normalized binary score of whether the journal publications has been cited by one or more policy-related documents from the Overton database. Null scores for journal publications not covered by the database. overton Normalizations being unable for preprints, binary score of whether the preprint has been cited by one or more policy-ralated documents from the Overton database. Null scores for preprints not covered by the database. daswriting_binary Binary score capturing identification of a data availability statement in the journal publication or preprint using the queries presented in the technical annex. Null scores are for publications and preprints where records of full texts were unavailable for text mining, or were this analysis could not be performed due to licensing restrictions. deposition_binary Binary score capturing identification of a data availability statement and data deposition mention therein in the journal publication or preprint using the queries presented in the technical annex. Null scores are for publications and preprints where records of full texts were unavailable for text mining, or were this analysis could not be performed due to licensing restrictions. is_oa Binary score capturing OA or free-to-read (also so-calleod "bronze OA" and "green OA") status of journal publications. Unpaywall categories have been used in a mutually exclusive implementation, with the best (gold > hybrid>bronze>green) possible applicable category being retained. Null scores for journal publications not covered in our Unpaywall dataset. Scores of 0 denote journal publications not available under an OA or free-to-read category. is_gold as above is_hybrid as above is_bronze as above is_green as above matched_journal_binary For preprints, whether one or more matching journal publications could be identified using the queries identified in the technical, or preprint servers' own lists of preprint-journal publication matches. Null scores for preprints with insufficient metadata information to perform the matching operation. matched_journal_doi For those preprints with or more matching journal publications, the DOI(s) of the matching journal publication(s). Note that some of the maching journal publications identified do not have DOIs. matched_preprint_binary For journal publications, whether one or more matching preceding preprints could be identified using the queries identified in the technical annex, or preprint servers' own lists of preprint-journal publication matches. Null scores for journal publications without sufficient metadata to run the analysis. matched_preprint_id For those journal publications preceded with one or more arXiv, bioRxiv, medRxiv or SSRN preprints, the DOI(s), arXiv ID and/or SSRN ID of the matching preprint(s). hasdoi Only journal publications with DOIs were retained in the core quantitative analyses. hasacknowledgements Only journal publications with funding acknowledgements (to determine funding-based signatory status) were retained in the core quantitative analyses. funder_array Array (but cast as string) of names of the funders on the basis of whose idenitification signatory status has been attributed, where relevant. Null if non-signatory or unknown signatory status. RPO_array Array (but cast as string) of names of the research performing organizations on the basis of whose idenitification signatory status has been attributed, where relevant. Null if non-signatory or unknown signatory status. DAS_excerpt Journal publication or preprint text excerpt on which succesful identifcation of data availability statements and/or data deposition mentions have been made. Null both where the query could not be run at all, or where the query was negative. big5 Journal publication published in a journal owned by one of the following five publishing houses: Elsevier, Sage, Springer Nature, Taylor-Francis, Wiley. LMIC Journal publication whose authors include at least one researcher affiliated with at least one institution located in a lower-middle income country as defined by the World Bank LIC Journal publication whose authors include at least one researcher affiliated with at least one institution located in a low income country as defined by the World Bank SouthNorth Journal publication whose authors include at least one researcher affiliated with at least one institution located in a upper-middle income country, a lower-middle income country, or a low income country as defined by the World Bank; as well as at least one researcher affiliated with at least one institution located in a high income country. For the purpose of this indicator, Sicnece-Metrix exceptionally includes China and Bulgaria in the list of high income countries. DID_allauthors_OR Journal publication is included in the difference-in-difference model defining signatory publication as EITHER holding journal-based signatory status OR funding-based signatory status, and where no filter has been applied to control for author-level biases. DID_authorcontrol_OR Journal publication is included in the difference-in-difference model defining signatory publication as EITHER holding journal-based signatory status OR funding-based signatory status, and where a filter has been applied to control for author-level biases. DID_authorcontrol_AND Journal publication is included in the difference-in-difference model defining signatory publication as holding journal-based signatory status AND funding-based signatory status, and where a filter has been applied to control for author-level biases. DID_allauthors_AND Journal publication is included in the difference-in-difference model defining signatory publication as holding journal-based signatory status AND funding-based signatory status, and where no filter has been applied to control for author-level biases. Preprint_authorcontrol Preprint is included in the the analytical breakdowns where a filter has been applied to control for author-level biases. Note that authors have been kept constant in preprints on the basis of their belonging to all analytical breakdowns in journal publications rather than in preprint-based groups.
本文件为Research Consulting与Science-Metrix于2022年发表的研究"From intent to impact: Investigating the effects of open sharing commitments"中的文献计量部分提供基础数据集。若需重现该研究结果,或为其他目的复用此基础数据集,请务必查阅该研究的技术附录与主报告中关于其局限性的说明,相关内容可访问:https://zenodo.org/communities/data-sharing-in-public-health-emergencies/ 需特别注意,本数据集在以下维度存在预估的错误率:期刊论文与预印本(preprint)的签署方身份归因、文献在特定疾病主题领域的分类、数据可用性声明(data availability statement)章节的识别、数据存档提及内容的识别,以及预印本与期刊论文的匹配。此类错误率属于预期范围内且已完成估算,完整细节请查阅技术报告。
数据字段的定义如下表所示:
1. document_type:预印本或期刊论文
2. doi:数字对象标识符(Digital Object Identifier,DOI)
3. arxiv_id:arXiv预印本平台的预印本唯一标识符
4. ssrn_id:SSRN预印本平台的预印本唯一标识符。请注意,部分此类标识符包含在部分(而非全部)SSRN预印本的DOI中,格式为"10.2139/ssrn." + 'ssrn_id'
5. coalesce_id:应用于DOI、arxiv_id与ssrn_id的合并函数结果,对于期刊论文而言该字段冗余
6. preprint_server:预印本的发布平台,本研究限定为arXiv、bioRxiv、medRxiv与SSRN
7. journal_title:期刊论文对应的发表期刊名称
8. year:新冠疫情相关预印本与期刊论文的年份限定为2020年与2021年。人类病毒性呼吸道疾病(Human Viral Respiratory Disease, HVRD)相关期刊论文的年份限定为2018-2019年,而HVRD相关预印本的年份则调整为2020-2021年,以弥补预印本缺乏年份归一化的不足,同时更好地对照2019年medRxiv平台上线后的研究结果进行分析
9. publication_title:单篇期刊论文或预印本的标题,而非发表期刊或预印本平台的名称
10. authors:预印本或期刊论文的前100位作者。该字段未经过解析,仅用于定性验证与评估,而非进一步的量化分析
11. Covid19:若通过本研究的查询规则(详见技术附录)将文献归类为新冠主题,则赋值为1,否则为0
12. HVRD:人类病毒性呼吸道疾病(Human Viral Respiratory Disease, HVRD),即与新冠主题最为接近的领域。若通过本研究的查询规则(详见技术附录)将文献归类为该主题,则赋值为1,否则为0
13. Journal_sig:发表期刊及其或其出版机构属于联合声明签署方的期刊论文。若为签署方则赋值为1,非签署方则为0,若因元数据不足无法确定状态则赋值为null。请注意,本研究纳入的所有预印本平台均为联合声明签署方,因此该类别未被纳入预印本的分析模型,而非将所有预印本自动赋予签署方身份
14. RPO_sig:至少有一位作者隶属于联合声明签署方的研究执行机构的期刊论文与预印本。若为签署方则赋值为1,非签署方则为0,若因元数据不足无法确定状态则赋值为null
15. Funder_sig:至少有一位资助方属于联合声明签署方的期刊论文与预印本。若为签署方则赋值为1,非签署方则为0,若因元数据不足无法确定状态则赋值为null。尽管资助关系通常归因于研究者而非论文,但资助元数据在论文层面更易获取;该方法同时可覆盖研究者在多个并行开展的研究项目中灵活调配资金的情况
16. overton_norm:经过年份与子领域归一化的二分类指标,用于表征期刊论文是否被Overton数据库中的一篇或多篇政策相关文献引用。未被该数据库覆盖的期刊论文赋值为null
17. overton:由于预印本无法进行归一化处理,该指标为二分类得分,用于表征预印本是否被Overton数据库中的一篇或多篇政策相关文献引用。未被该数据库覆盖的预印本赋值为null
18. daswriting_binary:基于本研究技术附录中的查询规则,识别期刊论文或预印本是否包含数据可用性声明的二分类得分。若无法获取全文用于文本挖掘,或因授权限制无法开展该分析,则赋值为null
19. deposition_binary:基于本研究技术附录中的查询规则,识别期刊论文或预印本是否包含数据可用性声明且其中提及数据存档的二分类得分。若无法获取全文用于文本挖掘,或因授权限制无法开展该分析,则赋值为null
20. is_oa:表征期刊论文开放获取(Open Access, OA)或可免费阅读(亦称为"青铜OA"与"绿色OA")状态的二分类得分。本研究采用互斥的Unpaywall分类体系,保留最适配的类别(金OA > 混合OA > 青铜OA > 绿色OA)。未被Unpaywall数据集覆盖的期刊论文赋值为null;得分为0则代表该期刊论文不属于OA或可免费阅读范畴
21. is_gold:同is_oa字段说明(金OA状态)
22. is_hybrid:同is_oa字段说明(混合OA状态)
23. is_bronze:同is_oa字段说明(青铜OA状态)
24. is_green:同is_oa字段说明(绿色OA状态)
25. matched_journal_binary:针对预印本,表征是否可通过技术附录中的查询规则或预印本平台自身的预印本-期刊论文匹配列表识别到一篇或多篇匹配的期刊论文。若元数据不足无法开展匹配操作,则赋值为null
26. matched_journal_doi:针对存在一篇或多篇匹配期刊论文的预印本,列出匹配期刊论文的一个或多个DOI。请注意,部分识别到的匹配期刊论文并未分配DOI
27. matched_preprint_binary:针对期刊论文,表征是否可通过技术附录中的查询规则或预印本平台自身的预印本-期刊论文匹配列表识别到一篇或多篇前置匹配的预印本。若元数据不足无法开展分析,则赋值为null
28. matched_preprint_id:针对存在一篇或多篇arXiv、bioRxiv、medRxiv或SSRN预印本前置匹配的期刊论文,列出匹配预印本的一个或多个DOI、arXiv标识符及/或SSRN标识符
29. hasdoi:核心量化分析仅纳入带有DOI的期刊论文
30. hasacknowledgements:核心量化分析仅纳入带有资助致谢信息(用于确定基于资助的签署方身份)的期刊论文
31. funder_array:以字符串形式存储的资助方名称数组,用于基于其识别签署方身份(如适用)。若非签署方或签署方身份未知,则赋值为null
32. RPO_array:以字符串形式存储的研究执行机构名称数组,用于基于其识别签署方身份(如适用)。若非签署方或签署方身份未知,则赋值为null
33. DAS_excerpt:成功识别出数据可用性声明和/或数据存档提及内容的期刊论文或预印本文本片段。若完全无法运行查询规则,或查询结果为阴性,则赋值为null
34. big5:发表于以下五家出版机构旗下期刊的论文:Elsevier、Sage、Springer Nature、Taylor-Francis与Wiley
35. LMIC:作者中至少有一位研究者隶属于世界银行定义的中低收入国家的期刊论文
36. LIC:作者中至少有一位研究者隶属于世界银行定义的低收入国家的期刊论文
37. SouthNorth:作者中至少有一位研究者隶属于世界银行定义的中高收入、中低收入或低收入国家,同时至少有一位研究者隶属于高收入国家的期刊论文。针对本指标,Science-Metrix特例将中国与保加利亚纳入高收入国家名单
38. DID_allauthors_OR:纳入双重差分模型(difference-in-difference model, DID)的期刊论文,该模型将签署方论文定义为同时满足基于期刊的签署方身份或基于资助的签署方身份,且未应用作者层面偏差控制过滤器
39. DID_authorcontrol_OR:纳入双重差分模型的期刊论文,该模型将签署方论文定义为同时满足基于期刊的签署方身份或基于资助的签署方身份,且应用了作者层面偏差控制过滤器
40. DID_authorcontrol_AND:纳入双重差分模型的期刊论文,该模型将签署方论文定义为同时满足基于期刊的签署方身份与基于资助的签署方身份,且应用了作者层面偏差控制过滤器
41. DID_allauthors_AND:纳入双重差分模型的期刊论文,该模型将签署方论文定义为同时满足基于期刊的签署方身份与基于资助的签署方身份,且未应用作者层面偏差控制过滤器
42. Preprint_authorcontrol:纳入应用了作者层面偏差控制过滤器的预印本分析细分数据集。需注意,预印本的作者分组依据其适配的期刊论文分析细分集合,而非预印本自身的分组
创建时间:
2023-06-28



