five

SCICM

收藏
arXiv2023-12-18 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2311.08189v3
下载链接
链接失效反馈
官方服务:
资源简介:
SCICM是一个专为科学文献跨模态信息提取设计的高质量数据集,由南开大学、东京工业大学、微软亚洲研究院和北京人工智能研究院合作创建。该数据集包含超过12,817篇论文,覆盖计算机科学、统计学、电气工程与系统科学等多个领域。创建过程采用半监督方法,首先由专家手动标注少量论文,随后利用这些标注数据训练模型来自动标注更多论文,最后由专家审核以确保标注质量。SCICM数据集旨在解决科学文献中信息提取的挑战,特别是在文本和表格数据中的实体和关系提取,以支持科学知识图谱构建、学术问答和方法推荐等下游任务。

SCICM is a high-quality dataset specifically designed for cross-modal information extraction from scientific literature, jointly created by Nankai University, Tokyo Institute of Technology, Microsoft Research Asia and Beijing Academy of Artificial Intelligence. This dataset contains over 12,817 research papers, covering multiple disciplines including computer science, statistics, electrical engineering and systems science. The dataset was constructed using a semi-supervised workflow: first, a small number of papers were manually annotated by domain experts; subsequently, these annotated data were used to train models for automatic annotation of additional papers; finally, expert review was conducted to ensure the quality of annotations. SCICM aims to address the challenges of information extraction from scientific literature, particularly entity and relation extraction from text and tabular data, to support downstream tasks such as scientific knowledge graph construction, academic question answering and method recommendation.
提供机构:
南开大学
创建时间:
2023-11-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作