AI used to diagnose and treat genetic diseases.
收藏DataCite Commons2025-05-01 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/f63nhynzhx
下载链接
链接失效反馈官方服务:
资源简介:
Step 2: Data Collection & Preprocessing
We will need genetic datasets such as:
• 1000 Genomes Project (for genetic variants)
• ClinVar (for pathogenic mutations)
• GTEx (for gene expression)
Python Code for Data Loading and Preprocessing
Generate a Synthetic Genetic Dataset
This dataset will include:
. Gene Mutations (Encoded as numerical values)
Expression Levels (Simulating gene expression data)
Mutation Type (Categorical: Missense, Nonsense, Frameshift)
Disease Labels (Binary classification: 0 = No Disease, 1 = Genetic Disease)
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Generate
data
num_samples = 1000
gene_mutations = np.random.randint(0, 10, num_samples) # 10 different mutation types
expression_levels = np.random.uniform(0.1, 10.0, num_samples) # Simulated expression levels
mutation_types = np.random.choice(["Missense", "Nonsense", "Frameshift"], num_samples)
disease_labels = np.random.choice([0, 1], num_samples) # 0 = No Disease, 1 = Disease
# Create DataFrame
df = pd.DataFrame({
"Gene_Mutation": gene_mutations,
"Expression_Level": expression_levels,
"Mutation_Type": mutation_types,
"Disease_Label": disease_labels
})
# Save to CSV
df.to_csv("genetic_data.csv", index=False)
print("Synthetic genetic dataset saved as 'genetic_data.csv'.")
Gene_Mutation Expression_Level Mutation_Type Disease_Label
0 6 2.634554 Missense 0
1 3 7.288346 Missense 1
2 7 5.970333 Frameshift 0
3 4 1.111905 Frameshift 1
4 6 9.195630 Missense 0
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gene_Mutation 1000 non-null int64
1 Expression_Level 1000 non-null float64
2 Mutation_Type 1000 non-null object
3 Disease_Label 1000 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 31.4+ KB
None
Gene_Mutation Expression_Level Mutation_Type Disease_Label
0 6 2.634554 1 0
1 3 7.288346 1 1
2 7 5.970333 0 0
3 4 1.111905 0 1
4 6 9.195630 1 0
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gene_Mutation 1000 non-null int64
1 Expression_Level 1000 non-null float64
2 Mutation_Type 1000 non-null int32
3 Disease_Label 1000 non-null int64
提供机构:
Mendeley Data
创建时间:
2025-02-24



