biomap-research/ssp_q8

Name: biomap-research/ssp_q8
Creator: biomap-research
Published: 2024-09-20 08:11:53
License: 暂无描述

Hugging Face2024-09-20 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/biomap-research/ssp_q8

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: seq dtype: string - name: label sequence: int64 splits: - name: train num_bytes: 24941535 num_examples: 10848 - name: test num_bytes: 1665908 num_examples: 667 download_size: 3931715 dataset_size: 26607443 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: apache-2.0 task_categories: - token-classification tags: - chemistry - biology size_categories: - 10K<n<100K --- # Dataset Card for Secondary Structure Prediction (Q8) Dataset ### Dataset Summary The study of a protein’s secondary structure (Sec. Struc. P.) forms a fundamental cornerstone in understanding its biological function. This secondary structure, comprising helices, strands, and various turns, bestows the protein with a specific three-dimensional configuration, which is critical for the formation of its tertiary structure. In the context of this work, a given protein sequence is classified into three distinct categories, each representing a different structural element: H - Alpha-helix, G - 3-10 helix, I - Pi helix, E - Beta-strand, B - Beta-bridge, T - Turn, S - Bend, C - Coil (or random coil). ## Dataset Structure ### Data Instances For each instance, there is a string of the protein sequences, a sequence for the strucutral labels. See the [Secondary structure prediction dataset viewer](https://huggingface.co/datasets/Bo1015/ssp_q8/viewer/default/test) to explore more examples. ``` {'seq':'GKITFYEDRGFQGRHYECSSDHSNLQPYFSRCNSIRVDSGCWMLYEQPNFQGPQYFLRRGDYPDYQQWMGLNDSIRSCRLIPHTGSHRLRIYEREDYRGQMVEITEDCSSLHDRFHFSEIHSFNVLEGWWVLYEMTNYRGRQYLLRPGDYRRYHDWGATNARVGSLRRAVDFY' 'label':[ 7, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 7, 5, 7, 3, 5, 7, 7, 6, 6, 6, 7, 5, 7, 7, 5, 4, 4, 4, 4, 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 0, 0, 0, 7, 5, 7, 4, 4, 4, 4, 7, 5, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 5, 5, 5, 7, 7, 7, 4, 4, 4, 4, 4, 7, 7, 7, 5, 7, 7, 4, 4, 4, 4, 4, 5, 5, 0, 0, 0, 7, 5, 7, 4, 4, 4, 4, 7, 5, 7, 3, 5, 7, 5, 6, 6, 6, 5, 5, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 5, 7, 4, 4, 4, 4, 5, 5, 5, 5, 5, 7, 5, 7, 4, 4, 4, 4, 7, 5, 4, 4, 4, 7, 5, 0, 0, 0, 0, 6, 7, 5, 5, 7, 7, 7, 7, 4, 4, 4, 4, 7, 7, 7, 7, 7 ]} ``` The average for the `seq` and the `label` are provided below: | Feature | Mean Count | | ---------- | ---------------- | | seq | 256 | | label (0) | 9 | | label (1) | 82 | | label (2) | 1 | | label (3) | 3 | | label (4) | 52 | | label (5) | 26 | | label (6) | 19 | | label (7) | 63 | ### Data Fields - `seq`: a string containing the protein sequence - `label`: a sequence containing the structural label of each residue. ### Data Splits The secondary structure prediction dataset has 2 splits: _train_ and _test_. Below are the statistics of the dataset. | Dataset Split | Number of Instances in Split | | ------------- | ------------------------------------------- | | Train | 10,848 | | Test | 667 | ### Source Data #### Initial Data Collection and Normalization The datasets applied in this study were originally published by [NetSurfP-2.0](https://pubmed.ncbi.nlm.nih.gov/30785653/). ### Licensing Information The dataset is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0). ### Citation If you find our work useful, please consider citing the following paper: ``` @misc{chen2024xtrimopglm, title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein}, author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others}, year={2024}, eprint={2401.06199}, archivePrefix={arXiv}, primaryClass={cs.CL}, note={arXiv preprint arXiv:2401.06199} } ```

提供机构：

biomap-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集