pie/comagc
收藏Hugging Face2024-08-07 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/pie/comagc
下载链接
链接失效反馈官方服务:
资源简介:
# PIE Dataset Card for "CoMAGC"
This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[CoMAGC Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/CoMAGC).
## Data Schema
The document type for this dataset is `ComagcDocument` which defines the following data fields:
- `pmid` (str): unique sentence identifier
- `sentence` (str)
- `cancer_type` (str)
- `cge` (str): change in gene expression
- `ccs` (str): change in cell state
- `pt` (str, optional): proposition type
- `ige` (str, optional): initial gene expression level
and the following annotation layers:
- `gene` (annotation type: `NamedSpan`, target: `sentence`)
- `cancer` (annotation type: `NamedSpan`, target: `sentence`)
- `expression_change_keyword1` (annotation type: `SpanWithNameAndType`, target: `sentence`)
- `expression_change_keyword2` (annotation type: `SpanWithNameAndType`, target: `sentence`)
`NamedSpan` is a custom annotation type that extends typical `Span` with the following data fields:
- `name` (str): entity string between span start and end
`SpanWithNameAndType` is a custom annotation type that extends typical `Span` with the following data fields:
- `name` (str): entity string between span start and end
- `type` (str): entity type classifying the expression
See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py) and
[here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py) for the annotation
type definitions.
## Document Converters
The dataset provides predefined document converters for the following target document types:
- `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`:
- **labeled_spans**: There are always two labeled spans in each sentence.
The first one refers to the gene, while the second one refers to the cancer.
Therefore, the `label` is either `"GENE"` or `"CANCER"`.
- **binary_relations**: There is always one binary relation in each sentence.
This relation is always established between the gene as `head` and the cancer as `tail`.
The specific `label` is the related **gene-class**. It is obtained from inference rules (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3)),
that are based on the values of the columns CGE, CCS, IGE and PT. In case no gene-class can be inferred,
no binary relation is added to the document. In total to 303 of the 821 examples,
there is no rule is applicable (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7)).
See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) and
[here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
# 用于“CoMAGC”的PIE数据集卡片
本项目是针对[CoMAGC 拥抱脸(Hugging Face)数据集加载脚本](https://huggingface.co/datasets/DFKI-SLT/CoMAGC)的[PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie)封装工具。
## 数据模式
本数据集的文档类型为`ComagcDocument`,其定义了如下数据字段:
- `pmid`(字符串类型):唯一句子标识符
- `sentence`(字符串类型)
- `cancer_type`(字符串类型):癌症类型
- `cge`(字符串类型):基因表达变化
- `ccs`(字符串类型):细胞状态变化
- `pt`(可选字符串类型):命题类型
- `ige`(可选字符串类型):初始基因表达水平
同时包含如下标注层:
- `gene`(标注类型:命名跨度(NamedSpan),标注目标:`sentence`)
- `cancer`(标注类型:命名跨度(NamedSpan),标注目标:`sentence`)
- `expression_change_keyword1`(标注类型:带名称与类型的跨度(SpanWithNameAndType),标注目标:`sentence`)
- `expression_change_keyword2`(标注类型:带名称与类型的跨度(SpanWithNameAndType),标注目标:`sentence`)
命名跨度(NamedSpan)是一种自定义标注类型,在标准`Span`基础上扩展了如下数据字段:
- `name`(字符串类型):跨度起止位置对应的实体文本
带名称与类型的跨度(SpanWithNameAndType)是另一种自定义标注类型,同样在标准`Span`基础上扩展了如下数据字段:
- `name`(字符串类型):跨度起止位置对应的实体文本
- `type`(字符串类型):用于对该表达进行分类的实体类型
标注类型的具体定义可参考[此处](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py)与[此处](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py)。
## 文档转换器
本数据集为如下目标文档类型提供了预定义的文档转换器:
- `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`:
- **带标注的跨度**:每个句子中始终包含两个带标注的跨度。第一个跨度对应基因,第二个对应癌症,因此标注标签仅可为`"GENE"`(基因)或`"CANCER"`(癌症)。
- **二元关系**:每个句子中始终存在一个二元关系。该关系始终以基因作为头实体(head),以癌症作为尾实体(tail)。具体的关系标签为对应的基因类别(gene-class),该标签通过基于CGE、CCS、IGE与PT列取值的推理规则推导而来(详见[此处](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3))。若无法推导出基因类别,则不会为文档添加该二元关系。在全部821个样本中,共有303个样本无适用规则(详见[此处](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7))。
文档类型的具体定义可参考[此处](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py)与[此处](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)。
提供机构:
pie



