pie/comagc

Hugging Face2024-08-07 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/pie/comagc

下载链接

链接失效反馈

官方服务：

资源简介：

# PIE Dataset Card for "CoMAGC" This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the [CoMAGC Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/CoMAGC). ## Data Schema The document type for this dataset is `ComagcDocument` which defines the following data fields: - `pmid` (str): unique sentence identifier - `sentence` (str) - `cancer_type` (str) - `cge` (str): change in gene expression - `ccs` (str): change in cell state - `pt` (str, optional): proposition type - `ige` (str, optional): initial gene expression level and the following annotation layers: - `gene` (annotation type: `NamedSpan`, target: `sentence`) - `cancer` (annotation type: `NamedSpan`, target: `sentence`) - `expression_change_keyword1` (annotation type: `SpanWithNameAndType`, target: `sentence`) - `expression_change_keyword2` (annotation type: `SpanWithNameAndType`, target: `sentence`) `NamedSpan` is a custom annotation type that extends typical `Span` with the following data fields: - `name` (str): entity string between span start and end `SpanWithNameAndType` is a custom annotation type that extends typical `Span` with the following data fields: - `name` (str): entity string between span start and end - `type` (str): entity type classifying the expression See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py) and [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py) for the annotation type definitions. ## Document Converters The dataset provides predefined document converters for the following target document types: - `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`: - **labeled_spans**: There are always two labeled spans in each sentence. The first one refers to the gene, while the second one refers to the cancer. Therefore, the `label` is either `"GENE"` or `"CANCER"`. - **binary_relations**: There is always one binary relation in each sentence. This relation is always established between the gene as `head` and the cancer as `tail`. The specific `label` is the related **gene-class**. It is obtained from inference rules (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3)), that are based on the values of the columns CGE, CCS, IGE and PT. In case no gene-class can be inferred, no binary relation is added to the document. In total to 303 of the 821 examples, there is no rule is applicable (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7)). See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) and [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions.

# 用于“CoMAGC”的PIE数据集卡片本项目是针对[CoMAGC 拥抱脸（Hugging Face）数据集加载脚本](https://huggingface.co/datasets/DFKI-SLT/CoMAGC)的[PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie)封装工具。 ## 数据模式本数据集的文档类型为`ComagcDocument`，其定义了如下数据字段： - `pmid`（字符串类型）：唯一句子标识符 - `sentence`（字符串类型） - `cancer_type`（字符串类型）：癌症类型 - `cge`（字符串类型）：基因表达变化 - `ccs`（字符串类型）：细胞状态变化 - `pt`（可选字符串类型）：命题类型 - `ige`（可选字符串类型）：初始基因表达水平同时包含如下标注层： - `gene`（标注类型：命名跨度（NamedSpan），标注目标：`sentence`） - `cancer`（标注类型：命名跨度（NamedSpan），标注目标：`sentence`） - `expression_change_keyword1`（标注类型：带名称与类型的跨度（SpanWithNameAndType），标注目标：`sentence`） - `expression_change_keyword2`（标注类型：带名称与类型的跨度（SpanWithNameAndType），标注目标：`sentence`）命名跨度（NamedSpan）是一种自定义标注类型，在标准`Span`基础上扩展了如下数据字段： - `name`（字符串类型）：跨度起止位置对应的实体文本带名称与类型的跨度（SpanWithNameAndType）是另一种自定义标注类型，同样在标准`Span`基础上扩展了如下数据字段： - `name`（字符串类型）：跨度起止位置对应的实体文本 - `type`（字符串类型）：用于对该表达进行分类的实体类型标注类型的具体定义可参考[此处](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py)与[此处](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py)。 ## 文档转换器本数据集为如下目标文档类型提供了预定义的文档转换器： - `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`： - **带标注的跨度**：每个句子中始终包含两个带标注的跨度。第一个跨度对应基因，第二个对应癌症，因此标注标签仅可为`"GENE"`（基因）或`"CANCER"`（癌症）。 - **二元关系**：每个句子中始终存在一个二元关系。该关系始终以基因作为头实体（head），以癌症作为尾实体（tail）。具体的关系标签为对应的基因类别（gene-class），该标签通过基于CGE、CCS、IGE与PT列取值的推理规则推导而来（详见[此处](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3)）。若无法推导出基因类别，则不会为文档添加该二元关系。在全部821个样本中，共有303个样本无适用规则（详见[此处](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7)）。文档类型的具体定义可参考[此处](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py)与[此处](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)。

提供机构：

pie

5,000+

优质数据集

54 个

任务类型

进入经典数据集