vlccek/MT_dataset
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vlccek/MT_dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Protein Mutation Fitness Dataset
This dataset contains protein sequences and their corresponding fitness scores, along with structural classifications (CATH).
## Files
- `train.parquet`: Training set containing 1,560,811 samples.
- `validation.parquet`: Validation set containing 178,050 samples.
## Column Descriptions
- `wt_sequence`: Wild-type protein sequence.
- `mut_sequence`: Mutated protein sequence.
- `mutation`: Specific mutation details.
- `fitness`: Mutation impact score (+1 = maximally stabilizing, -1 = maximally destabilizing).
- `cath_*`: Structural classification based on the CATH database (Class, Architecture, Topology, Homology).
- `data_source`: Origin of the data.
- `reverse`: Indicator for reverse mutation data.
## Usage
The files are in Apache Parquet format and can be easily loaded using pandas:
```python
import pandas as pd
df = pd.read_parquet('train.parquet')
```
提供机构:
vlccek



