five

Annotated Dataset for Uncertainty Mining : Gold Standard

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14134214
下载链接
链接失效反馈
官方服务:
资源简介:
Description of the dataset In order to study the expression of uncertainty in scientific articles, we have put together an interdisciplinary corpus of journals in the fields of Science, Technology and Medicine (STM) and the Humanities and Social Sciences (SHS). The selection of journals in our corpus is based on the Scimago Journal and Country Rank (SJR) classification, which is based on Scopus, the largest academic database available online. We have selected journals covering various disciplines, such as medicine, biochemistry, genetics and molecular biology, computer science, social sciences, environmental sciences, psychology, arts and humanities. For each discipline, we selected the five highest-ranked journals. In addition, we have included the journals PLoS ONE and Nature, both of which are interdisciplinary and highly ranked. Based on the corpus of articles from different disciplines described above, we created a set of annotated sentences as follows: 593 were pre-selected automatically, by studying the occurrences of the lists of uncertainty indices proposed by Bongelli et al. (2019), Chen et al. (2018) and Hyland (1996). The remaining sentences were extracted from a subset of articles, consisting of two randomly selected articles per journal. These articles were examined by two human annotators to identify sentences containing uncertainty and to annotate them. 600 sentences not expressing scientific uncertainty were manually identified and reviewed by two annotators The sentences were annotated by two independent annotators following the annotation guide proposed by Ningrum and Atanassova (2024). The annotators were trained on the basis of an annotation guide and previously annotated sentences in order to guarantee the consistency of the annotations. Each sentence was annotated as expressing or not expressing uncertainty (Uncertainty and No Uncertainty).Sentences expressing uncertainty were then annotated along five dimensions: Reference , Nature, Context , Timeline and Expression. The annotators reached an average agreement score of 0.414 according to Cohen's Kappa test, which shows the difficulty of the task of annotating scientific uncertainty.Finally, conflicting annotations were resolved by a third independent annotator. Our final corpus thus consists of a total of 1 840 sentences from 496 articles in 21 English-language journals from 8 different disciplines.The columns of the table are as follows: journal: name of the journal from where the article originates article_title:  title of the article from where the sentence is extracted publication_year: year of publication of the article sentence_text: text of the sentence expressing or not expressing uncertainty uncertainty: 1 if the sentence expresses uncertainty and 0 otherwise; ref, nature, context, timeline, expression: annotations of the type of uncertainty according to the annotation framework proposed by Ningrum and Atanassova (2023). The annotation of each dimension in this dataset are in numeric format rather than textual. The mapping betwen textual and numeric labels is presented in the Table below. Dimension 1 2 3 4 5 Reference Author Former Both     Nature Epistemic Aleatory Both     Context Background Methods Res&Disc Conclusion Others Timeline Past Present Future     Expression Quantified Unquantified       This gold standard has been produced as part of the ANR InSciM (Modelling Uncertainty in Science) project.  References Bongelli, R., Riccioni, I., Burro, R., & Zuczkowski, A. (2019). Writers’ uncertainty in scientific and popular biomedical articles. A comparative analysis of the British Medical Journal and Discover Magazine [Publisher: Public Library of Science]. PLoS ONE, 14 (9). https://doi.org/10.1371/journal.pone.0221933 Chen, C., Song, M., & Heo, G. E. (2018). A scalable and adaptive method for finding semantically equivalent cue words of uncertainty. Journal of Informetrics, 12 (1), 158–180. https://doi.org/10.1016/j.joi.2017.12.004 Hyland, K. E. (1996). Talking to the academy forms of hedging in science research articles [Publisher: SAGE Publications Inc.]. Written Communication, 13 (2), 251–281. https://doi.org/10.1177/0741088396013002004 Ningrum, P. K., & Atanassova, I. (2023). Scientific Uncertainty: An Annotation Framework and Corpus Study in Different Disciplines. 19th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2023). https://doi.org/10.5281/zenodo.8306035 Ningrum, P. K., & Atanassova, I. (2024). Annotation of scientific uncertainty using linguistic patterns. Scientometrics. https://doi.org/10.1007/s11192-024-05009-z
创建时间:
2024-11-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作