five

pie/comagc

收藏
Hugging Face2024-08-07 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/pie/comagc
下载链接
链接失效反馈
官方服务:
资源简介:
# PIE Dataset Card for "CoMAGC" This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the [CoMAGC Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/CoMAGC). ## Data Schema The document type for this dataset is `ComagcDocument` which defines the following data fields: - `pmid` (str): unique sentence identifier - `sentence` (str) - `cancer_type` (str) - `cge` (str): change in gene expression - `ccs` (str): change in cell state - `pt` (str, optional): proposition type - `ige` (str, optional): initial gene expression level and the following annotation layers: - `gene` (annotation type: `NamedSpan`, target: `sentence`) - `cancer` (annotation type: `NamedSpan`, target: `sentence`) - `expression_change_keyword1` (annotation type: `SpanWithNameAndType`, target: `sentence`) - `expression_change_keyword2` (annotation type: `SpanWithNameAndType`, target: `sentence`) `NamedSpan` is a custom annotation type that extends typical `Span` with the following data fields: - `name` (str): entity string between span start and end `SpanWithNameAndType` is a custom annotation type that extends typical `Span` with the following data fields: - `name` (str): entity string between span start and end - `type` (str): entity type classifying the expression See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py) and [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py) for the annotation type definitions. ## Document Converters The dataset provides predefined document converters for the following target document types: - `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`: - **labeled_spans**: There are always two labeled spans in each sentence. The first one refers to the gene, while the second one refers to the cancer. Therefore, the `label` is either `"GENE"` or `"CANCER"`. - **binary_relations**: There is always one binary relation in each sentence. This relation is always established between the gene as `head` and the cancer as `tail`. The specific `label` is the related **gene-class**. It is obtained from inference rules (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3)), that are based on the values of the columns CGE, CCS, IGE and PT. In case no gene-class can be inferred, no binary relation is added to the document. In total to 303 of the 821 examples, there is no rule is applicable (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7)). See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) and [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions.

# 用于“CoMAGC”的PIE数据集卡片 本项目是针对[CoMAGC 拥抱脸(Hugging Face)数据集加载脚本](https://huggingface.co/datasets/DFKI-SLT/CoMAGC)的[PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie)封装工具。 ## 数据模式 本数据集的文档类型为`ComagcDocument`,其定义了如下数据字段: - `pmid`(字符串类型):唯一句子标识符 - `sentence`(字符串类型) - `cancer_type`(字符串类型):癌症类型 - `cge`(字符串类型):基因表达变化 - `ccs`(字符串类型):细胞状态变化 - `pt`(可选字符串类型):命题类型 - `ige`(可选字符串类型):初始基因表达水平 同时包含如下标注层: - `gene`(标注类型:命名跨度(NamedSpan),标注目标:`sentence`) - `cancer`(标注类型:命名跨度(NamedSpan),标注目标:`sentence`) - `expression_change_keyword1`(标注类型:带名称与类型的跨度(SpanWithNameAndType),标注目标:`sentence`) - `expression_change_keyword2`(标注类型:带名称与类型的跨度(SpanWithNameAndType),标注目标:`sentence`) 命名跨度(NamedSpan)是一种自定义标注类型,在标准`Span`基础上扩展了如下数据字段: - `name`(字符串类型):跨度起止位置对应的实体文本 带名称与类型的跨度(SpanWithNameAndType)是另一种自定义标注类型,同样在标准`Span`基础上扩展了如下数据字段: - `name`(字符串类型):跨度起止位置对应的实体文本 - `type`(字符串类型):用于对该表达进行分类的实体类型 标注类型的具体定义可参考[此处](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py)与[此处](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py)。 ## 文档转换器 本数据集为如下目标文档类型提供了预定义的文档转换器: - `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`: - **带标注的跨度**:每个句子中始终包含两个带标注的跨度。第一个跨度对应基因,第二个对应癌症,因此标注标签仅可为`"GENE"`(基因)或`"CANCER"`(癌症)。 - **二元关系**:每个句子中始终存在一个二元关系。该关系始终以基因作为头实体(head),以癌症作为尾实体(tail)。具体的关系标签为对应的基因类别(gene-class),该标签通过基于CGE、CCS、IGE与PT列取值的推理规则推导而来(详见[此处](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3))。若无法推导出基因类别,则不会为文档添加该二元关系。在全部821个样本中,共有303个样本无适用规则(详见[此处](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7))。 文档类型的具体定义可参考[此处](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py)与[此处](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)。
提供机构:
pie
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作