Cross-domain Dataset of Scientific Texts in Russian
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了200篇俄语科学论文摘要,这些摘要来自10个不同的领域,并针对任务、贡献、方法和结论等方面进行了注释。这些领域包括医学、历史、新闻学、法学、语言学、数学、教育学、物理学、心理学和计算机科学。数据集的注释工作由两位作者完成,他们之间的注释一致性非常高(F1得分为0.92)。摘要文本的平均长度为115个词元,范围从50到177个词元不等。规模上,该数据集包含了200篇摘要,其任务是对科学文本中的各个方面进行提取。
This dataset contains 200 Russian-language scientific paper abstracts spanning 10 distinct academic domains, which are annotated in terms of tasks, contributions, methodologies, conclusions and other relevant aspects. The 10 domains cover medicine, history, journalism, law, linguistics, mathematics, pedagogy, physics, psychology, and computer science. The annotation was completed by two annotators, achieving extremely high inter-annotator agreement with an F1 score of 0.92. The average length of the abstract texts is 115 tokens, ranging from 50 to 177 tokens. In terms of scale, this dataset includes 200 abstracts, whose core task is to extract various aspects from scientific texts.
提供机构:
Authors of the paper



