scico

Name: scico
Creator: maas
Published: 2025-07-03 16:29:08
License: 暂无描述

魔搭社区2025-07-03 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/scico

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for SciCo ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [SciCo homepage](https://scico.apps.allenai.org/) - **Repository:** [SciCo repository](https://github.com/ariecattan/scico) - **Paper:** [SciCo: Hierarchical Cross-document Coreference for Scientific Concepts](https://openreview.net/forum?id=OFLbgUP04nC) - **Point of Contact:** [Arie Cattan](arie.cattan@gmail.com) ### Dataset Summary SciCo consists of clusters of mentions in context and a hierarchy over them. The corpus is drawn from computer science papers, and the concept mentions are methods and tasks from across CS. Scientific concepts pose significant challenges: they often take diverse forms (e.g., class-conditional image synthesis and categorical image generation) or are ambiguous (e.g., network architecture in AI vs. systems research). To build SciCo, we develop a new candidate generation approach built on three resources: a low-coverage KB ([https://paperswithcode.com/](https://paperswithcode.com/)), a noisy hypernym extractor, and curated candidates. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages The text in the dataset is in English. ## Dataset Structure ### Data Instances [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Data Fields * `flatten_tokens`: a single list of all tokens in the topic * `flatten_mentions`: array of mentions, each mention is represented by [start, end, cluster_id] * `tokens`: array of paragraphs * `doc_ids`: doc_id of each paragraph in `tokens` * `metadata`: metadata of each doc_id * `sentences`: sentences boundaries for each paragraph in `tokens` [start, end] * `mentions`: array of mentions, each mention is represented by [paragraph_id, start, end, cluster_id] * `relations`: array of binary relations between cluster_ids [parent, child] * `id`: id of the topic * `hard_10` and `hard_20` (only in the test set): flag for 10% or 20% hardest topics based on Levenshtein similarity. * `source`: source of this topic PapersWithCode (pwc), hypernym or curated. ### Data Splits | |Train |Validation|Test | |--------------------|-----:|---------:|----:| |Topic | 221| 100| 200| |Documents | 9013| 4120| 8237| |Mentions | 10925| 4874|10424| |Clusters | 4080| 1867| 3711| |Relations | 2514| 1747| 2379| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations ## Additional Information ### Dataset Curators This dataset was initially created by Arie Cattan, Sophie Johnson, Daniel Weld, Ido Dagan, Iz Beltagy, Doug Downey and Tom Hope, while Arie was intern at Allen Institute of Artificial Intelligence. ### Licensing Information This dataset is distributed under [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ``` @inproceedings{ cattan2021scico, title={SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts}, author={Arie Cattan and Sophie Johnson and Daniel S. Weld and Ido Dagan and Iz Beltagy and Doug Downey and Tom Hope}, booktitle={3rd Conference on Automated Knowledge Base Construction}, year={2021}, url={https://openreview.net/forum?id=OFLbgUP04nC} } ``` ### Contributions Thanks to [@ariecattan](https://github.com/ariecattan) for adding this dataset.

# SciCo 数据集卡片 ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集概述 - **主页：** [SciCo 主页](https://scico.apps.allenai.org/) - **代码仓库：** [SciCo 代码仓库](https://github.com/ariecattan/scico) - **相关论文：** [SciCo: Hierarchical Cross-document Coreference for Scientific Concepts](https://openreview.net/forum?id=OFLbgUP04nC) - **联系方式：** [Arie Cattan](arie.cattan@gmail.com) ### 数据集摘要 SciCo由上下文提及实体簇及其层级结构组成。该语料库源自计算机科学学术论文，其中的概念提及涵盖计算机科学领域的各类方法与任务。科学概念存在诸多显著挑战：它们常具有多样化的表达形式（例如「类条件图像合成」与「类别式图像生成」），或存在语义歧义（例如人工智能领域与系统研究领域中的「网络架构」）。为构建SciCo数据集，我们开发了一种全新的候选实体生成方法，该方法依托三类资源构建：低覆盖率知识库（KB）[https://paperswithcode.com/]、带噪声的上位词提取器，以及人工筛选的候选实体。 ### 支持任务与排行榜 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言数据集中的文本均为英文。 ## 数据集结构 ### 数据实例 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 数据字段 * `flatten_tokens`：主题下所有Token的单列表 * `flatten_mentions`：提及实体数组，每个提及实体以`[起始位置, 结束位置, 簇ID]`的形式表示 * `tokens`：段落数组 * `doc_ids`：`tokens`中每个段落对应的文档ID * `metadata`：每个文档ID对应的元数据 * `sentences`：`tokens`中每个段落的句子边界`[起始位置, 结束位置]` * `mentions`：提及实体数组，每个提及实体以`[段落ID, 起始位置, 结束位置, 簇ID]`的形式表示 * `relations`：簇ID之间的二元关系数组，形式为`[父簇ID, 子簇ID]` * `id`：主题ID * `hard_10`与`hard_20`（仅测试集包含）：基于莱文斯坦相似度（Levenshtein similarity）筛选出的前10%、前20%高难度主题的标记字段 * `source`：主题来源，可选值为PapersWithCode（pwc）、上位词提取结果或人工筛选结果。 ### 数据划分 | | 训练集 | 验证集 | 测试集 | |--------------------|-------:|--------:|-------:| | 主题 | 221 | 100 | 200 | | 文档 | 9013 | 4120 | 8237 | | 提及实体 | 10925 | 4874 | 10424 | | 簇 | 4080 | 1867 | 3711 | | 关系 | 2514 | 1747 | 2379 | ## 数据集构建 ### 构建依据 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源文本生产者身份 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员身份 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差分析 [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 ## 附加信息 ### 数据集维护者本数据集最初由Arie Cattan、Sophie Johnson、Daniel Weld、Ido Dagan、Iz Beltagy、Doug Downey与Tom Hope共同构建，其中Arie Cattan当时为艾伦人工智能研究所（Allen Institute of Artificial Intelligence）实习生。 ### 许可信息本数据集采用[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)协议进行分发。 ### 引用信息 @inproceedings{ cattan2021scico, title={SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts}, author={Arie Cattan and Sophie Johnson and Daniel S. Weld and Ido Dagan and Iz Beltagy and Doug Downey and Tom Hope}, booktitle={3rd Conference on Automated Knowledge Base Construction}, year={2021}, url={https://openreview.net/forum?id=OFLbgUP04nC} } ### 贡献致谢感谢[@ariecattan](https://github.com/ariecattan)为本数据集的收录提供支持。

提供机构：

maas

创建时间：

2025-05-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集