Annotated Dataset for Uncertainty Mining : Gold Standard
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14134214
下载链接
链接失效反馈官方服务:
资源简介:
Description of the dataset
In order to study the expression of uncertainty in scientific articles, we have put together an interdisciplinary corpus of journals in the fields of Science, Technology and Medicine (STM) and the Humanities and Social Sciences (SHS). The selection of journals in our corpus is based on the Scimago Journal and Country Rank (SJR) classification, which is based on Scopus, the largest academic database available online. We have selected journals covering various disciplines, such as medicine, biochemistry, genetics and molecular biology, computer science, social sciences, environmental sciences, psychology, arts and humanities. For each discipline, we selected the five highest-ranked journals. In addition, we have included the journals PLoS ONE and Nature, both of which are interdisciplinary and highly ranked.
Based on the corpus of articles from different disciplines described above, we created a set of annotated sentences as follows:
593 were pre-selected automatically, by studying the occurrences of the lists of uncertainty indices proposed by Bongelli et al. (2019), Chen et al. (2018) and Hyland (1996).
The remaining sentences were extracted from a subset of articles, consisting of two randomly selected articles per journal. These articles were examined by two human annotators to identify sentences containing uncertainty and to annotate them.
600 sentences not expressing scientific uncertainty were manually identified and reviewed by two annotators
The sentences were annotated by two independent annotators following the annotation guide proposed by Ningrum and Atanassova (2024). The annotators were trained on the basis of an annotation guide and previously annotated sentences in order to guarantee the consistency of the annotations. Each sentence was annotated as expressing or not expressing uncertainty (Uncertainty and No Uncertainty).Sentences expressing uncertainty were then annotated along five dimensions: Reference , Nature, Context , Timeline and Expression. The annotators reached an average agreement score of 0.414 according to Cohen's Kappa test, which shows the difficulty of the task of annotating scientific uncertainty.Finally, conflicting annotations were resolved by a third independent annotator.
Our final corpus thus consists of a total of 1 840 sentences from 496 articles in 21 English-language journals from 8 different disciplines.The columns of the table are as follows:
journal: name of the journal from where the article originates
article_title: title of the article from where the sentence is extracted
publication_year: year of publication of the article
sentence_text: text of the sentence expressing or not expressing uncertainty
uncertainty: 1 if the sentence expresses uncertainty and 0 otherwise;
ref, nature, context, timeline, expression: annotations of the type of uncertainty according to the annotation framework proposed by Ningrum and Atanassova (2023). The annotation of each dimension in this dataset are in numeric format rather than textual. The mapping betwen textual and numeric labels is presented in the Table below.
Dimension
1
2
3
4
5
Reference
Author
Former
Both
Nature
Epistemic
Aleatory
Both
Context
Background
Methods
Res&Disc
Conclusion
Others
Timeline
Past
Present
Future
Expression
Quantified
Unquantified
This gold standard has been produced as part of the ANR InSciM (Modelling Uncertainty in Science) project.
References
Bongelli, R., Riccioni, I., Burro, R., & Zuczkowski, A. (2019). Writers’ uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine [Publisher: Public Library of Science]. PLoS ONE, 14 (9). https://doi.org/10.1371/journal.pone.0221933
Chen, C., Song, M., & Heo, G. E. (2018). A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. Journal of Informetrics, 12 (1), 158–180. https://doi.org/10.1016/j.joi.2017.12.004
Hyland, K. E. (1996). Talking to the academy forms of hedging in science research articles [Publisher: SAGE Publications Inc.]. Written Communication, 13 (2), 251–281. https://doi.org/10.1177/0741088396013002004
Ningrum, P. K., & Atanassova, I. (2023). Scientific Uncertainty: An Annotation Framework and Corpus Study in Different Disciplines. 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2023). https://doi.org/10.5281/zenodo.8306035
Ningrum, P. K., & Atanassova, I. (2024). Annotation of scientific uncertainty using linguistic patterns. Scientometrics. https://doi.org/10.1007/s11192-024-05009-z
创建时间:
2024-11-13



