Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a retrospective analysis of the diversity ofcontributions to published research output

Name: Data and code for: The Contributor Role Taxonomy (CRediT) at ten: a retrospective analysis of the diversity ofcontributions to published research output
Creator: Whittam, Ruth; Allen, Liz; Kiermer, Veronique; Porter, Simon
Published: 2025-12-02 00:00:00
License: 暂无描述

Figshare2025-12-02 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/Data_and_code_for_The_Contributor_Role_Taxonomy_CRediT_at_ten_a_retrospective_analysis_of_the_diversity_ofcontributions_to_published_research_output/28816703/1

下载链接

链接失效反馈

官方服务：

资源简介：

## About this notebook This notebook was created using a helper script: Base.ipynb. This script has some helper functions that push output directly to Datawrapper to generate the graphs that have been included in the opnion piece. To run without the helper functions and bigquery alone use !pip install google-cloud-bigquery then add: from google.cloud.bigquery import magicsproject_id = "your_project" # update as neededmagics.context.project = project_id bq_params = {} client = bigquery.Client(project=project_id)%load_ext google.cloud.bigquery finally, comment out the make_chart lines. ### About dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is an internal table that is the result of runing a process over the text of publications in order to identify trustmarker segments including authors contributions. The dataset was created on March 25th 2025. The process works as follows: The process aims to automatically segment research papers into their constituent sections. It operates by identifying headings within the text based on a pre-defined set of patterns and a rule-based system. The system first cleans and normalizes the input text. It then employs regular expressions to detect potential section headings. These potential headings are validated against a set of rules that consider factors such as capitalization, the context of surrounding words, and the typical order of sections within a research paper (e.g., certain sections not appearing after "References" or before "Abstract"). Specific rules also handle exceptions for particular heading types like "Keywords" or "Appendices." Once valid headings are identified, the system extracts the corresponding textual content for each section. The output is a structured representation of the paper, categorizing text segments under their respective heading types. Any text that doesn't fall under a recognized heading is also identified as unlabeled content. The overall process aims to provide a structured understanding of the document's organization for subsequent analysis. Author Contributions segments are identified using the following regex: "author_contributions": ["((credit|descript(ion(?:s)?|ive)| )*author(s|'s|ship|s')?( |contribution(?:s)?|statement(?:s)?|role(?:s)?){2,})","contribution(?:s)"] Access to dimensions-ai-integrity.ds_dp_pipeline_ripeta_staging.trust_markers_raw is available to peer reveiwers of the opinion piece. Datasets that allow external validation of the credit ontology process identification process have also been produced.

提供机构：

Whittam, Ruth; Allen, Liz; Kiermer, Veronique; Porter, Simon

创建时间：

2025-11-19