Citation contexts of [How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data, DOI: 10.1371/journal.pone.0005738]

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14417421

下载链接

链接失效反馈

官方服务：

资源简介：

Methods for the construction of the corpus of citation contexts We used Semantic Scholar (https://www.semanticscholar.org), an academic database encompassing over 200 million scholarly documents from diverse sources including publishers, data providers, and web crawlers. Using the specific paper identifier for Fanelli's 2009 publication (d9db67acc223c9bd9b8c1d4969dc105409c6dfef), we queried the Semantic Scholar API (https://www.semanticscholar.org/product/api) to retrieve available citation contexts. Citation contexts were extracted from the "contexts" field within the JSON response pages, (see technical specifications here: https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_get_paper_citations). The query looks like this: https://api.semanticscholar.org/graph/v1/paper/d9db67acc223c9bd9b8c1d4969dc105409c6dfef/citations?fields=title,year,publicationVenue,externalIds,contexts,intents,isInfluential,abstract&offset=1&limit=100 The broad coverage of Semantic Scholar does not imply that citation contexts are always retrieved. The Semantic Scholar API provided citation contexts for only 48% of the 1452 documents citing the paper. To get more, we identified open access papers among the remaining 52% citing papers, retrieved their PDF location and downloaded the files. We used Unpaywall API (https://unpaywall.org/products/api), which is a database to be queried with a DOI in order to get open access information about a document. The query looks like: https://api.unpaywall.org/v2/10.1162/qss_a_00220?email=mail@example.com We downloaded 266 PDF files and converted them to text format using an online bulk PDF-to-text converter (https://overbits.herokuapp.com/pdftotext/). These files were then processed using TXM https://txm.gitpages.huma-num.fr/textometrie/en/Presentation/), a specialized textual analysis tool. We used its concordancer function to identify the term"Fanelli" as a pivot term and check the reference being the good one (the 2009 paper in PlosOne). We did manual cleaning and appended the citation contexts to the previous corpus. Through this comprehensive methodology, we ultimately identified 824 citation contexts, representing 54% (784) of all documents citing Fanelli's 2009 paper. This corpus comprised 48% of contexts retrieved from Semantic Scholar and an additional 6% obtained through semi-manual extraction from open access documents. 87 of those contexts were excluded from the analysis for a range of reasons including: context too short to conclude, language neither English nor French (shared languages of the authors of this review), duplicate documents (e.g. preprints), etc, leaving us with 737 contexts. They were first classified manually in two categories, those mentioning the 2% figure and those which did not. Then, for the first category, they were further classified manually in two categories depending on whether the figure was appropriately assigned to self-reporting of researchers or rather misleadingly suggesting that the 2% applied to research outputs. File structure The file is an .xlsx file composed of three sheets. The first sheet entitled "citcontext (RAW DATA)" includes all information retrieved from the process described above. The second sheet entitled "Excluded from analysis" shows the 87 records excluded from analysis with brief descriptions of the reasons for exclusion. The 737 contexts analysed are showed in the third sheet ("Analysis of citcontext") together with the classifications described above.

创建时间：

2024-12-13