five

ayates/amr_portal

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ayates/amr_portal
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - biology pretty_name: AMR portal size_categories: - 1M<n<10M configs: - config_name: genotype data_files: - split: train path: genotype.parquet - config_name: phenotype data_files: - split: train path: phenotype.parquet --- # AMR Portal — Multi-Dataset Release (Phenotype + Genotype) This repository contains multiple datasets from the EMBL-EBI AMR Portal, distributed in Apache Parquet format: - `phenotype.parquet` – phenotypic antimicrobial susceptibility data - `genotype.parquet` – AMR genes and mutations from _in silico_ methods All datasets are released under CC-BY-4.0. Source documentation: <https://www.ebi.ac.uk/amr/developers/> ## Dataset Summary This dataset contains **phenotypic antimicrobial resistance (AMR) data** derived from antibiograms submitted to BioSamples from the CABBAGE data set. It provides curated experimental measurements such as MIC values, AMR gene annotation, and metadata describing geography, organism taxonomy, and antibiotics. The dataset is stored in **Apache Parquet**. This release corresponds to: **FTP location:** `https://ftp.ebi.ac.uk/pub/databases/amr_portal/releases/2025-11/` **Portal documentation:** <https://www.ebi.ac.uk/amr/developers/> --- ## Dataset Contents This dataset represents AMR **phenotypic experimental outcomes** collected from antibiograms. Each row corresponds to an individual antibiotic test performed on an isolate. Typical use cases include: - AMR prediction - Taxonomic/epidemiological AMR analysis - Linking phenotypes to genomic or genotypic data --- ## Data Schema ### `phenotype.parquet`: AMR phenotypes | Field | Type | Nullable | Description | |:----------------------------|:---------|:-----------|:-------------------------------------------------------------------------------------------------------------------------| | BioSample_ID | `string` | No | The unique identifier for the biological sample (e.g. SAMEA1028830) | | SRA_accession | `string` | Yes | The SRA accession number | | assembly_ID | `string` | Yes | The unique accession number of the genome assembly (e.g. GCA_001096525.1) | | collection_year | `int32` | Yes | The year the sample was collected | | ISO_country_code | `string` | Yes | The 3-letter ISO country code where the sample was collected (e.g. THA for Thailand) | | host | `string` | Yes | The organism the sample was isolated from (e.g. Homo sapiens) | | host_age | `string` | Yes | The age of the host (empty/NULL in the sample data, but should be a string to allow for various formats or NULLs) | | host_sex | `string` | Yes | The sex of the host (empty/NULL in the sample data) | | isolate | `string` | Yes | A unique identifier for the specific isolate (e.g. SMRU2695) | | isolation_source | `string` | Yes | The specific anatomical source or environment the isolate came from (e.g. nasopharynx) | | isolation_source_category | `string` | Yes | The general category of the isolation source (e.g. respiratory tract) | | isolation_latitude | `string` | Yes | Geographic latitude for the sample | | isolation_longitude | `string` | Yes | Geographic longitude for the sample | | genus | `string` | No | The genus of the organism (e.g. Streptococcus) | | organism | `string` | No | The full name of the organism (e.g. Streptococcus pneumoniae) | | AMR_associated_publications | `string` | Yes | The PubMed ID of the publication associated with the data. Can be a set of values joined with a ; | | Updated_phenotype_CLSI | `string` | Yes | The updated antimicrobial susceptibility testing (AST) phenotype based on CLSI standards (empty/NULL in the sample data) | | Updated_phenotype_EUCAST | `string` | Yes | The updated AST phenotype based on EUCAST standards (empty/NULL in the sample data) | | used_ECOFF | `string` | Yes | Indicates if the Epidemiological Cut-Off (ECOFF) was used (empty/NULL in the sample data) | | database | `string` | Yes | Database of annotation | | antibiotic_name | `string` | Yes | The name of the antibiotic tested (e.g. beta-lactams, trimethoprim-sulfamethoxazole) | | ast_standard | `string` | Yes | The standard or guideline used for Antimicrobial Susceptibility Testing (e.g. CLSI, EUCAST) | | laboratory_typing_method | `string` | Yes | The method used to test the antibiotic sensitivity (e.g. disk diffusion, E-test) | | measurement | `string` | Yes | The raw measurement value, typically MIC or zone size (e.g. 2, 1, 0.5, 12/0.125). | | measurement_sign | `string` | Yes | The sign indicating the nature of the measurement (e.g. '==' for exact value, or '>', '<') | | measurement_units | `string` | Yes | The units for the measurement (e.g. mg/l) | | platform | `string` | Yes | The platform used for analysis (empty/NULL in the sample data) | | resistance_phenotype | `string` | Yes | The final result of the interpretation (e.g. susceptible, non-susceptible, resistant) | | species | `string` | No | The species of the organism (e.g. Streptococcus pneumoniae) | | antibiotic_ontology | `string` | Yes | An ontology ID for the antibiotic (e.g. ARO_3004024) | | antibiotic_ontology_link | `string` | Yes | Link to the ontology resource for the ID | | country | `string` | Yes | Full country name where the sample was collected from. Converted from `ISO_country_code`. | | geographical_region | `string` | Yes | Geographical region as defined by UN M49. e.g Asia, Europe, Oceania, Africa or Americas. | | geographical_subregion | `string` | Yes | Geographical subregion as defined by UN M49. e.g.Eastern Asia, Northern Europe. | ### `genotype.parquet`: AMR genotypes | Field | Type | Nullable | Description | |:-------------------------|:---------|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------| | BioSample_ID | `string` | No | The unique identifier for the biological sample (e.g., SAMEA1028830) | | assembly_ID | `string` | No | The unique accession number of genome assembly (e.g., GCA_001096525.1) | | genus | `string` | No | The genus of the organism (e.g., Streptococcus) | | species | `string` | No | The species name of the organism | | organism | `string` | No | The full name of the organism (e.g., Streptococcus pneumoniae) | | isolate | `string` | Yes | Isolate information | | taxon_id | `int64` | No | NCBI Taxonomy identifier of the organism | | region | `string` | No | Name of a genomic region | | region_start | `int64` | No | Start of the annotated gene | | region_end | `int64` | No | End of the annotated gene | | strand | `string` | No | Strand of the annotated gene. '+' indicates the forward strand, '-' indicates the reverse strand | | _bin | `int64` | No | UCSC bin number for the genomic region. See [UCSC's wiki](https://genomewiki.ucsc.edu/index.php/Bin_indexing_system) for further details. Internal field | | id | `string` | No | Identifier of the gene | | gene_symbol | `string` | No | Symbol of the gene | | amr_element_symbol | `string` | No | AMRFinderPlus assigned symbol for the AMR element | | element_type | `string` | No | Broad type of AMR element. Normally set to `AMR` | | element_subtype | `string` | No | Subtype of AMR element. Normally set to `AMR` | | class | `string` | No | Overall class of AMR compound as given by AMRFinderPlus. Normally a broad representation of antibiotics | | subclass | `string` | No | Subclass of AMR compound as given by AMRFinderPlus. Can also be set to the same as class | | split_subclass | `string` | No | Subclass can represent multiple individual compounds separated by a '/'. This field contains the individual element of subclass. | | antibiotic_name | `string` | Yes | Normalised name of the antibiotic tested (e.g., beta-lactams, trimethoprim-sulfamethoxazole) | | antibiotic_ontology | `string` | Yes | Ontology ID for the antibiotic (e.g., ARO_3004024) | | antibiotic_ontology_link | `string` | Yes | Link to ontology entry for the antibiotic | | evidence_accession | `string` | Yes | Accession number for evidence supporting the predicted AMR resistance | | evidence_type | `string` | Yes | Type of evidence supporting the predicted AMR resistance | | evidence_link | `string` | Yes | Link to the evidence supporting the predicted AMR resistance | | evidence_description | `string` | Yes | Evidence description supporting the predicted AMR resistance | --- ## How to Load the Dataset Load phenotype data ```python from datasets import load_dataset phenotype = load_dataset( "ayates/amr_portal", data_files="phenotype.parquet", split="train" ) ``` Load genotype data ```python genotype = load_dataset( "ayates/amr_portal", data_files="genotype.parquet", split="train" ) ``` Load everything at once ```python ds = load_dataset( "ayates/amr_portal", data_files={ "phenotype": "phenotype.parquet", "genotype": "genotype.parquet" } ) ``` You can then access: ```python ds["phenotype"] ds["genotype"] ``` ## License Creative Commons Attribution 4.0 (CC-BY-4.0) <https://creativecommons.org/licenses/by/4.0/> ## Citation Please cite Dickens E _et al._ 2025 [10.1101/2025.11.12.688105](https://doi.org/10.1101/2025.11.12.688105). ```txt @article {Dickens2025.11.12.688105, author = {Dickens, Emily and Derelle, Romain and Beardmore, Robert and Suresh, Anita and Uplekar, Swapna and Yates, Andrew D and Keatley, Jon and Winterbottom, Andrea and Azov, Andrey G and El Houdaigui, Bilal and Ochkalova, Sofiia and Gurbich, Tatiana A and Shivalikanjli, Anu and Yordanova, Galabina and Lees, John A and Chindelevitch, Leonid}, title = {A comprehensive AMR genotype-phenotype database (CABBAGE)}, elocation-id = {2025.11.12.688105}, year = {2025}, doi = {10.1101/2025.11.12.688105}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2025/11/13/2025.11.12.688105}, eprint = {https://www.biorxiv.org/content/early/2025/11/13/2025.11.12.688105.full.pdf}, journal = {bioRxiv} } ```
提供机构:
ayates
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作