scbirlab/evo-pharm-atlas-1
收藏Hugging Face2026-05-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/scbirlab/evo-pharm-atlas-1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
tags:
- biology
- chemistry
- antibiotics
pretty_name: Evolutionary Pharmacology Atlas 1
size_categories:
- 100k<n<1M
configs:
- config_name: conservation
default: true
data_files: "orthology/_all.rbh.tsv.gz"
sep: "\t"
- config_name: canonical-target-info
data_files: "umap/_all.umap-targets-taxon.csv.gz"
sep: ","
- config_name: species-info
data_files: "orthology/_all.rbh_m.coldata.tsv.gz"
sep: "\t"
- config_name: inhibitors
data_files: "inhibitors/all.tsv.gz"
sep: "\t"
---
# **evo-pharm-atlas-1**
Calculated protein drug target conservation for 10,584 targets from ChEMBL v36 across 8207 bacterial proteomes from UniProt KB.
## Dataset Details
### Dataset Description
- **Curated by:** [@eachanjohnson](https://huggingface.co/eachanjohnson)
- **Funded by:** The Francis Crick Institute; UKRI
- **License:** CC-by-4.0
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
<!-- - **Repository:** https://doi.org/10.5281/zenodo.8136904 -->
- **Paper** https://doi.org/10.1038/s41589-023-01349-8
<!-- - **Demo [optional]:** [More Information Needed] -->
## Dataset Structure
The data are separated into four configs.
### conservation
Conservation of canonical targets across bacterial proteomes.
- **target_taxon_id**: NCBI taxonomy ID of source species of canonical drug target
- **target_taxon_l1**: Approximately taxonomic Kingdom of canonical drug target (from ChEMBL)
- **target_taxon_l2**: Approximately taxonomic Class of canonical drug target (from ChEMBL)
- **target_taxon_l3**: Approximately taxonomic Family of canonical drug target (from ChEMBL)
- **target_uniprot_id**: UniProt KB ID of canonical drug target protein
- **ortholog_uniprot_id**: UniProt KB ID of ortholog protein
- **target_accession**: Full UniProt accession ID of canonical drug target protein
- **ortholog_accession**: Full UniProt accession ID of ortholog protein
- **target_length**: Number of amino acids in canonical drug target
- **ortholog_length**: Number of amino acids in ortholog protein
- **alignment_length**: Length of BLASTp alignment (from Diamond)
- **target_ortholog_identity**: Amino acid identity between drug target and ortholog (from Diamond)
- **gap_openings**: Number of gaps in alignment (from Diamond)
- **mismatches**: Number of mismatches in alignment (from Diamond)
- **e_value**: Expectation-value of alignment (from Diamond)
- **bit_score**: Score of alignment (from Diamond)
- **target_ortholog_coverage**: Aligment length divided by query length
- **target_organism_name**: Source species name of canonical drug target (from ChEMBL)
- **target_is_species_group**: Flag whether target is annotated as being from a group of species, e.g. "Bacteria" (from ChEMBL)
- **target_gene_symbol**: Gene symbol of canonical drug target (from ChEMBL)
- **target_ec_number**: Enzyme Commission number drug target (from ChEMBL)
- **target_go_process_id**: Semi-colon-separated list of Gene Ontology process IDs of canonical drug target (from ChEMBL)
- **target_go_process_name**: Semi-colon-separated list of Gene Ontology process names of canonical drug target (from ChEMBL)
- **target_chembl_id**: Canonical drug target ChEMBL ID
- **target_name**: Canonical drug target name
- **pLI**: For human targets, Probability of Loss-of-function Intolerance (from gNomad)
- **LOEUF**: For human targets, Loss-of-function Observed/Expected Upper bound Fraction (from gNomad)
- **ortholog_target_name**: Ortholog protein name
- **ortholog_target_locus**: Ortholog gene locus tag
- **ortholog_taxon_id**: NCBI taxonomy ID of source species of ortholog protein
- **target_is_human**: Flag whether canonical drug target is human protein
- **target_is_bacteria**: Flag whether canonical drug target is bacterial protein
### canonical-target-info
Information on canonical drug targets and their aggregate conservation across bacterial proteomes.
- **target_uniprot_id**: UniProt KB ID of canonical drug target protein
- **target_taxon_id**: NCBI taxonomy ID of source species of canonical drug target
- **target_taxon_l1**: Approximately taxonomic Kingdom of canonical drug target (from ChEMBL)
- **target_taxon_l2**: Approximately taxonomic Class of canonical drug target (from ChEMBL)
- **target_taxon_l3**: Approximately taxonomic Family of canonical drug target (from ChEMBL)
- **target_accession**: Full UniProt accession ID of canonical drug target protein
- **target_length**: Number of amino acids in canonical drug target
- **target_organism_name**: Source species name of canonical drug target (from ChEMBL)
- **target_is_species_group**: Flag whether target is annotated as being from a group of species, e.g. "Bacteria" (from ChEMBL)
- **target_gene_symbol**: Gene symbol of canonical drug target (from ChEMBL)
- **target_ec_number**: Enzyme Commission number drug target (from ChEMBL)
- **target_go_process_id**: Semi-colon-separated list of Gene Ontology process IDs of canonical drug target (from ChEMBL)
- **target_go_process_name**: Semi-colon-separated list of Gene Ontology process names of canonical drug target (from ChEMBL)
- **target_chembl_id**: Canonical drug target ChEMBL ID
- **target_name**: Canonical drug target name
- **target_is_human**: Flag whether canonical drug target is human protein
- **target_is_bacteria**: Flag whether canonical drug target is bacterial protein
- **entropy**: Shannon entropy of target amino acid identity across all bacterial strains
- **sparsity**: Sparsity index of target amino acid identity across all bacterial strains
- **mean_conservation**: Mean target amino acid identity across all bacterial strains
- **median_conservation**: Median target amino acid identity across all bacterial strains
- **UMAP ...**: Target x bacteria amino acid identity matrix projected into two dimensions using UMAP.
### species-info
Information on all bacterial strains in the dataset.
- **ortholog_taxon_id**: NCBI taxonomy ID of bacterial strain
- **ortholog_domain**: Domain of bacterial strain
- **ortholog_kingdom**: Kingdom of bacterial strain
- **ortholog_phylum**: Phylum of bacterial strain
- **ortholog_class**: Class of bacterial strain
- **ortholog_order**: Order of bacterial strain
- **ortholog_family**: Family of bacterial strain
- **ortholog_genus**: Genus of bacterial strain
- **ortholog_species**: Species of bacterial strain
- **ortholog_subspecies**: Subspecies of bacterial strain
- **ortholog_strain**: Bacterial strain
### inhibitors
Information on ChEMBL inhibitors that engage conserved drug targets.
- **molecule_chembl_id**: Molecule ChEMBL ID of chemical
- **molecule_name**: Name of chemical (from ChEMBL)
- **molecule_smiles**: SMILES of chemical (from ChEMBL)
- **molecule_inchikey**: InChI Key of chemical (from ChEMBL)
- **is_oral**: Is orally-administered drug (from ChEMBL)
- **is_topical**: Is topical drug (from ChEMBL)
- **is_parenteral**: Is parenteral-administered drug (from ChEMBL)
- **is_orphan**: Is orphan drug (from ChEMBL)
- **is_natural_product**: Is a natural product (from ChEMBL)
- **is_chemical_probe**: Is a chemical probe (from ChEMBL)
- **has_black_box**: For drugs, has an FDA black box warning (from ChEMBL)
- **max_phase**: Maximum clinical phase (from ChEMBL)
- **molecule_chembl_id_2**: ChEMBL ID of chemical (from UniChem, redundant)
- **chembl_url**: URL to ChEMBL web page for chemical (from UniChem)
- **pubchem_id**: PubChem ID of chemical (from UniChem)
- **pubchem_url**: PubChem URL of chemical (from UniChem)
- **drugbank_id**: For drugs, DrugBank ID (from UniChem)
- **drugbank_url**: For drugs, DrugBank URL (from UniChem)
- **vendor_...**: Vendor catalog numbers (from UniChem)
- **..._url**: Vendor URLs (from UniChem)
- **target_taxon_id**: NCBI taxonomy ID of source species of canonical drug target
- **target_organism_name**: Source species name of canonical drug target (from ChEMBL)
- **target_chembl_id**: Target ChEMBL ID of canonical drug target
- **target_name**: Name of canonical drug target
- **assay_chembl_id**: Assay ChEMBL ID for assay indicating interaction of chemical with target
- **assay_type**: F = functional; B = binding
- **assay_target_confidence_score**: ChEMBL confidence in accurate target annotation
- **molecule_target_pchembl**: Quantitative assay outcome (usually -log10(IC50 in molar)).
## Dataset Creation
### Curation Rationale
To identify bacterial orthologous targets of existing drugs for rapid and rational repurposing.
#### Data Collection and Processing
Data were generated using the [Repurposing by Othology (RepOrt) pipeline](https://github.com/scbirlab/nf-report), which uses [Diamond](https://github.com/bbuchfink/diamond) for high-throughput BLASTP, and ChEMBL and UniChem APIs.
#### Personal and Sensitive Information
None.
<!-- ## Bias, Risks, and Limitations -->
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
<!-- [More Information Needed] -->
<!-- ### Recommendations -->
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
<!-- Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. -->
<!-- ## Citation
**BibTeX:**
```
@article{}
```
**APA:**
> -->
## Dataset Card Contact
[@eachanjohnson](https://huggingface.co/eachanjohnson)
提供机构:
scbirlab



