Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a retrospective analysis of the diversity ofcontributions to published research output
收藏Figshare2025-12-02 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Data_and_code_for_The_Contributor_Role_Taxonomy_CRediT_at_ten_a_retrospective_analysis_of_the_diversity_ofcontributions_to_published_research_output/28816703/1
下载链接
链接失效反馈官方服务:
资源简介:
## About this notebook<br>This notebook was created using a helper script: Base.ipynb. This script has some helper functions that push output directly to Datawrapper to generate the graphs that have been included in the opnion piece. To run without the helper functions and bigquery alone use<br>!pip install google-cloud-bigquery<br>then add:<br>from google.cloud.bigquery import magicsproject_id = "your_project" # update as neededmagics.context.project = project_id<br>bq_params = {}<br><br>client = bigquery.Client(project=project_id)%load_ext google.cloud.bigquery<br>finally, comment out the make_chart lines.<br>### About dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw<br>dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is an internal table that is the result of runing a process over the text of publications in order to identify trustmarker segments including authors contributions. The dataset was created on March 25th 2025.<br>The process works as follows:<br>The process aims to automatically segment research papers into their constituent sections. It operates by identifying headings within the text based on a pre-defined set of patterns and a rule-based system. The system first cleans and normalizes the input text. It then employs regular expressions to detect potential section headings. These potential headings are validated against a set of rules that consider factors such as capitalization, the context of surrounding words, and the typical order of sections within a research paper (e.g., certain sections not appearing after "References" or before "Abstract"). Specific rules also handle exceptions for particular heading types like "Keywords" or "Appendices." Once valid headings are identified, the system extracts the corresponding textual content for each section. The output is a structured representation of the paper, categorizing text segments under their respective heading types. Any text that doesn't fall under a recognized heading is also identified as unlabeled content. The overall process aims to provide a structured understanding of the document's organization for subsequent analysis.<br>Author Contributions segments are identified using the following regex:<br>"author_contributions": ["((credit|descript(ion(?:s)?|ive)| )*author(s|'s|ship|s')?( |contribution(?:s)?|statement(?:s)?|role(?:s)?){2,})","contribution(?:s)"]<br>Access to dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is available to peer reveiwers of the opinion piece.<br><br>Datasets that allow external validation of the credit ontology process identification process have also been produced.
提供机构:
Whittam, Ruth; Allen, Liz; Kiermer, Veronique; Porter, Simon
创建时间:
2025-11-19



