vadis/sv-ident
收藏数据集概述
数据集名称
- 名称: SV-Ident
- 别名: 无
数据集基本信息
- 语言: 英语 (en), 德语 (de)
- 许可证: MIT
- 多语言性: 多语言
- 数据集大小: 1K<n<10K
- 源数据: 原始数据
数据集任务
- 任务类别: 文本分类
- 具体任务:
- 多标签分类
- 语义相似度分类
数据集结构
- 数据实例: 包含句子、是否包含变量、提及的变量、研究数据ID、文档ID、唯一ID和语言等字段。
- 数据字段:
sentence: 文本实例,可能包含变量提及。is_variable: 标签,指示文本实例是否包含变量提及。variable: 文本实例中提及的变量。research_data: 与实例相关的研究数据ID。doc_id: 源文档的ID。uuid: 实例的唯一ID。lang: 句子的语言。
- 数据分割:
- 训练集: 3,823 句
- 验证集: 425 句
数据集创建
- 创建理由: 由VADIS项目创建,用于“社会科学出版物中的调查变量识别”共享任务。
- 源数据: 来自GESIS的未处理数据。
- 注释者: 两名专家注释者。
- 个人敏感信息: 数据集中不包含个人或敏感信息。
许可证信息
- 数据源自社会科学开放存取资源库(SSOAR),遵循相应的许可证。
引用信息
@inproceedings{tsereteli-etal-2022-overview, title = "Overview of the {SV}-Ident 2022 Shared Task on Survey Variable Identification in Social Science Publications", author = "Tsereteli, Tornike and Kartal, Yavuz Selim and Ponzetto, Simone Paolo and Zielinski, Andrea and Eckert, Kai and Mayr, Philipp", booktitle = "Proceedings of the Third Workshop on Scholarly Document Processing", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.sdp-1.29", pages = "229--246", abstract = "In this paper, we provide an overview of the SV-Ident shared task as part of the 3rd Workshop on Scholarly Document Processing (SDP) at COLING 2022. In the shared task, participants were provided with a sentence and a vocabulary of variables, and asked to identify which variables, if any, are mentioned in individual sentences from scholarly documents in full text. Two teams made a total of 9 submissions to the shared task leaderboard. While none of the teams improve on the baseline systems, we still draw insights from their submissions. Furthermore, we provide a detailed evaluation. Data and baselines for our shared task are freely available at url{https://github.com/vadis-project/sv-ident}.", }



