mevol/protein_structure_NER_model_v1.4

Name: mevol/protein_structure_NER_model_v1.4
Creator: mevol
Published: 2023-11-01 10:00:18
License: 暂无描述

Hugging Face2023-11-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mevol/protein_structure_NER_model_v1.4

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en tags: - biology - protein structure - token classification configs: - config_name: protein_structure_NER_model_v1.4 data_files: - split: train path: "annotation_IOB/train.tsv" - split: dev path: "annotation_IOB/dev.tsv" - split: test path: "annotation_IOB/test.tsv" --- ## Overview This data was used to train model: https://huggingface.co/mevol/BiomedNLP-PubMedBERT-ProteinStructure-NER-v1.4 There are 19 different entity types in this dataset: "chemical", "complex_assembly", "evidence", "experimental_method", "gene", "mutant", "oligomeric_state", "protein", "protein_state", "protein_type", "ptm", "residue_name", "residue_name_number","residue_number", "residue_range", "site", "species", "structure_element", "taxonomy_domain" The data prepared as IOB formated input has been used during training, develiopment and testing. Additional data formats such as JSON and XML as well as CSV files are also available and are described below. Annotation was carried out with the free annotation tool TeamTat (https://www.teamtat.org/) and documents were downloaded as BioC XML before converting them to IOB, annotation only JSON and CSV format. The number of annotations and sentences in each file is given below: | document ID | number of annotations in BioC XML | number of annotations in IOB/JSON/CSV | number of sentences | | --- | --- | --- | --- | | PMC4850273 | 1121 | 1121 | 204 | | PMC4784909 | 865 | 865 | 204 | | PMC4850288 | 716 | 708 | 146 | | PMC4887326 | 933 | 933 | 152 | | PMC4833862 | 1044 | 1044 | 192 | | PMC4832331 | 739 | 718 | 134 | | PMC4852598 | 1229 | 1218 | 250 | | PMC4786784 | 1549 | 1549 | 232 | | PMC4848090 | 987 | 985 | 191 | | PMC4792962 | 1268 | 1268 | 256 | | PMC4841544 | 1434 | 1433 | 273 | | PMC4772114 | 825 | 825 | 166 | | PMC4872110 | 1276 | 1276 | 253 | | PMC4848761 | 887 | 883 | 252 | | PMC4919469 | 1628 | 1616 | 336 | | PMC4880283 | 771 | 771 | 166 | | PMC4937829 | 625 | 625 | 181 | | PMC4968113 | 1238 | 1238 | 292 | | PMC4854314 | 481 | 471 | 139 | | PMC4871749 | 383 | 383 | 76 | | total | 19999 | 19930 | 4095 | Documents and annotations are easiest viewed by using the BioC XML files and opening them in free annotation tool TeamTat. More about the BioC format can be found here: https://bioc.sourceforge.net/ ## Raw BioC XML files These are the raw, un-annotated XML files for the publications in the dataset in BioC format. The files are found in the directory: "raw_BioC_XML". There is one file for each document and they follow standard naming "unique PubMedCentral ID"_raw.xml. ## Annotations in IOB format The IOB formated files can be found in the directory: "annotation_IOB" The four files are as follows: * all.tsv --> all sentences and annotations used to create model "mevol/BiomedNLP-PubMedBERT-ProteinStructure-NER-v1.4"; 4095 sentences * train.tsv --> training subset of the data; 2866 sentences * dev.tsv --> development subset of the data; 614 sentences * test.tsv --> testing subset of the data; 615 sentences The total number of annotations is: 19930 ## Annotations in BioC JSON The BioC formated JSON files of the publications have been downloaded from the annotation tool TeamTat. The files are found in the directory: "annotated_BioC_JSON" There is one file for each document and they follow standard naming "unique PubMedCentral ID"_ann.json Each document JSON contains the following relevant keys: * "sourceid" --> giving the numerical part of the unique PubMedCentral ID * "text" --> containing the complete raw text of the publication as a string * "denotations" --> containing a list of all the annotations for the text Each annotation is a dictionary with the following keys: * "span" --> gives the start and end of the annotatiom span defined by sub keys: * "begin" --> character start position of annotation * "end" --> character end position of annotation * "obj" --> a string containing a number of terms that can be separated by ","; the order of the terms gives the following: entity type, reference to ontology, annotator, time stamp * "id" --> unique annotation ID Here an example: ```json [{"sourceid":"4784909", "sourcedb":"", "project":"", "target":"", "text":"", "denotations":[{"span":{"begin":24, "end":34}, "obj":"chemical,CHEBI:,melaniev@ebi.ac.uk,2023-03-21T15:19:42Z", "id":"4500"}, {"span":{"begin":50, "end":59}, "obj":"taxonomy_domain,DUMMY:,melaniev@ebi.ac.uk,2023-03-21T15:15:03Z", "id":"1281"}] } ] ``` ## Annotations in BioC XML The BioC formated XML files of the publications have been downloaded from the annotation tool TeamTat. The files are found in the directory: "annotated_BioC_XML" There is one file for each document and they follow standard naming "unique PubMedCentral ID_ann.xml The key XML tags to be able to visualise the annotations in TeamTat as well as extracting them to create the training data are "passage" and "offset". The "passage" tag encloses a text passage or paragraph to which the annotations are linked. "Offset" gives the passage/ paragraph offset and allows to determine the character starting and ending postions of the annotations. The tag "text" encloses the raw text of the passage. Each annotation in the XML file is tagged as below: * "annotation id=" --> giving the unique ID of the annotation * "infon key="type"" --> giving the entity type of the annotation * "infon key="identifier"" --> giving a reference to an ontology for the annotation * "infon key="annotator"" --> giving the annotator * "infon key="updated_at"" --> providing a time stamp for annotation creation/update * "location" --> start and end character positions for the annotated text span * "offset" --> start character position as defined by offset value * "length" --> length of the annotation span; sum of "offset" and "length" creates the end character position Here is a basic example of what the BioC XML looks like. Additional tags for document management are not given. Please refer to the documenttation to find out more. ```xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE collection SYSTEM "BioC.dtd"> <collection> <source>PMC</source> <date>20140719</date> <key>pmc.key</key> <document> <id>4784909</id> <passage> <offset>0</offset> <text>The Structural Basis of Coenzyme A Recycling in a Bacterial Organelle</text> <annotation id="4500"> <infon key="type">chemical</infon> <infon key="identifier">CHEBI:</infon> <infon key="annotator">melaniev@ebi.ac.uk</infon> <infon key="updated_at">2023-03-21T15:19:42Z</infon> <location offset="24" length="10"/> <text>Coenzyme A</text> </annotation> </passage> </document> </collection> ``` ## Annotations in CSV The annotations and the relevant sentences they have been found in have also been made available as tab-separated CSV files, one for each publication in the dataset. The files can be found in directory "annotation_CSV". Each file is named as "unique PubMedCentral ID".csv. The column labels in the CSV files are as follows: * "anno_start" --> character start position of the annotation * "anno_end" --> character end position of the annotation * "anno_text" --> text covered by the annotation * "entity_type" --> entity type of the annotation * "sentence" --> sentence text in which the annotation was found * "section" --> publication section in which the annotation was found ## Annotations in JSON A combined JSON file was created only containing the relevant sentences and associated annotations for each publication in the dataset. The file can be found in directory "annotation_JSON" under the name "annotations.json". The following keys are used: * "PMC4850273" --> unique PubMedCentral of the publication * "annotations" --> list of dictionaries for the relevant, annotated sentences of the document; each dictionary has the following sub keys * "sid" --> unique sentence ID * "sent" --> sentence text as string * "section" --> publication section the sentence is in * "ner" --> nested list of annotations; each sublist contains the following items: start character position, end character position, annotation text, entity type Here is an example of a sentence and its annotations: ```json {"PMC4850273": {"annotations": [{"sid": 0, "sent": "Molecular Dissection of Xyloglucan Recognition in a Prominent Human Gut Symbiont", "section": "TITLE", "ner": [ [24,34,"Xyloglucan","chemical"], [62,67,"Human","species"],] },] }} ```

提供机构：

mevol

原始信息汇总

数据集概述

数据集用途

该数据集用于训练蛋白质结构命名实体识别（NER）模型。

实体类型

数据集中包含19种不同的实体类型：

chemical
complex_assembly
evidence
experimental_method
gene
mutant
oligomeric_state
protein
protein_state
protein_type
ptm
residue_name
residue_name_number
residue_number
residue_range
site
species
structure_element
taxonomy_domain

数据格式

数据以IOB格式准备，用于训练、开发和测试。此外，还提供JSON、XML和CSV格式的数据。

数据文件配置

配置名称: protein_structure_NER_model_v1.4
数据文件:
- 训练集: annotation_IOB/train.tsv
- 开发集: annotation_IOB/dev.tsv
- 测试集: annotation_IOB/test.tsv

数据统计

document ID	注释数量（BioC XML）	注释数量（IOB/JSON/CSV）	句子数量
PMC4850273	1121	1121	204
PMC4784909	865	865	204
PMC4850288	716	708	146
PMC4887326	933	933	152
PMC4833862	1044	1044	192
PMC4832331	739	718	134
PMC4852598	1229	1218	250
PMC4786784	1549	1549	232
PMC4848090	987	985	191
PMC4792962	1268	1268	256
PMC4841544	1434	1433	273
PMC4772114	825	825	166
PMC4872110	1276	1276	253
PMC4848761	887	883	252
PMC4919469	1628	1616	336
PMC4880283	771	771	166
PMC4937829	625	625	181
PMC4968113	1238	1238	292
PMC4854314	481	471	139
PMC4871749	383	383	76
总计	19999	19930	4095

数据文件目录

原始BioC XML文件: raw_BioC_XML
IOB格式文件: annotation_IOB
BioC JSON文件: annotated_BioC_JSON
BioC XML文件: annotated_BioC_XML
CSV文件: annotation_CSV
JSON文件: annotation_JSON

数据文件详情

IOB格式文件:
- all.tsv: 包含所有用于创建模型的句子和注释，共4095个句子。
- train.tsv: 训练数据子集，共2866个句子。
- dev.tsv: 开发数据子集，共614个句子。
- test.tsv: 测试数据子集，共615个句子。
- 注释总数: 19930
BioC JSON文件:
- 每个文档一个文件，命名格式为unique PubMedCentral ID_ann.json。
- 包含以下键:
  - sourceid: 唯一PubMedCentral ID的数值部分。
  - text: 出版物的完整原始文本。
  - denotations: 文本的所有注释列表。
BioC XML文件:
- 每个文档一个文件，命名格式为unique PubMedCentral ID_ann.xml。
- 包含以下标签:
  - annotation id: 唯一注释ID。
  - infon key="type": 注释的实体类型。
  - infon key="identifier": 注释的参考本体。
  - infon key="annotator": 注释者。
  - infon key="updated_at": 注释创建/更新时间戳。
  - location: 注释文本的起始和结束字符位置。
CSV文件:
- 每个文档一个文件，命名格式为unique PubMedCentral ID.csv。
- 包含以下列:
  - anno_start: 注释的起始字符位置。
  - anno_end: 注释的结束字符位置。
  - anno_text: 注释覆盖的文本。
  - entity_type: 注释的实体类型。
  - sentence: 包含注释的句子文本。
  - section: 注释所在的出版物部分。
JSON文件:
- 包含所有出版物的相关句子和关联注释的组合JSON文件，位于annotation_JSON目录下，命名为annotations.json。
- 包含以下键:
  - PMC4850273: 出版物的唯一PubMedCentral ID。
  - annotations: 文档的相关注释句子列表，每个句子包含以下子键:
    - sid: 唯一句子ID。
    - sent: 句子文本。
    - section: 句子所在的出版物部分。
    - ner: 嵌套的注释列表，每个子列表包含起始字符位置、结束字符位置、注释文本和实体类型。

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个专注于蛋白质结构命名实体识别的生物医学数据集，包含19种不同的实体类型和多种注释格式，总注释数达19930个，覆盖4095个句子。尽管存在数据生成时的列不匹配问题，但数据集仍提供了丰富的生物医学文本注释资源。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集