BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling
收藏DataCite Commons2026-05-06 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18905195
下载链接
链接失效反馈官方服务:
资源简介:
Description
Overview
The BioRemPP Database (Bioremediation Potential Profile Database) is a curated, integrated resource designed to support environmental bioremediation research by systematically linking chemical compounds, genes, enzymes, and regulatory frameworks. This database addresses a critical gap in bioremediation science: the absence of a unified, standardized resource connecting priority pollutant with their potential biodegradation pathways across multiple knowledge bases and regulatory contexts.
Scientific Rationale
Environmental contamination by xenobiotic compounds—including chlorinated solvents, polyaromatic hydrocarbons, pesticides, and heavy metals—poses significant ecological and public health challenges. While substantial knowledge exists regarding microbial biodegradation capabilities, this information remains fragmented across databases and regulatory frameworks. BioRemPP systematically integrates these sources into a unified, FAIR-compliant (Findable, Accessible, Interoperable, Reusable) framework.
Database Contents (v1.0.0)
This release contains:
10,869 database entries linking compounds to functional annotations
384 unique chemical compounds with standardized identifiers (CAS, ChEBI, SMILES)
1,541 KEGG Orthology (KO) identifiers for functional annotation
12 chemical compound classes for classification
Data Sources and Integration
Data were curated from multiple authoritative sources:
Regulatory Frameworks:
ATSDR (Agency for Toxic Substances and Disease Registry) Substance Priority List
EPA (U.S. Environmental Protection Agency) National Priorities List
CONAMA (Brazilian National Environment Council) Regulations
IARC (International Agency for Research on Cancer) Classifications (Groups 1, 2A, 2B)
EU Water Framework Directive Priority Substances
Canadian Environmental Protection Act Priority Substances List
Functional Annotations:
KEGG (Kyoto Encyclopedia of Genes and Genomes) — KO identifiers, pathways, enzymes
HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) — degradation-specific coverage
BlastKOALA (v3.0) and EggNOG-mapper (v2) — sequence-based functional assignments
Chemical Classification:
ChEBI (Chemical Entities of Biological Interest) — standardized identifiers, SMILES, compound classes
Toxicological Annotations:
ToxCSM — machine learning-based multi-endpoint toxicity predictions (mutagenicity, carcinogenicity, environmental hazard)
Integration Framework and External Database Contributions
BioRemPP is not a copy or aggregation of existing databases. Rather, it is an integration layer that establishes compound-centric relationships across independent external resources, each contributing distinct and complementary information. BioRemPP does not redistribute primary data from these sources; instead, it provides standardized cross-references and relational mappings that enable users to navigate between resources through a unified analytical framework.
The data architecture organizes compound-gene-enzyme relationships within a centralized core containing 1,541 unique KOs and 384 compounds. This core is linked to three external resources that expand functional and toxicological coverage through cross-references, not data duplication.
The following external databases contribute specific layers of information to BioRemPP:
External Resource
Contribution to BioRemPP
Quantitative Coverage
Relationship Type
KEGG (Kyoto Encyclopedia of Genes and Genomes)
KO identifiers, gene symbols, enzyme classifications (EC), and pathway associations for functional annotation
855 entries from 20 canonical xenobiotic metabolism pathways
Cross-reference via KO identifiers
HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes)
Extends degradation-specific gene coverage for hydrocarbons, polymers, and biosurfactant-related pathways
867 entries across 71 sub-pathways
Cross-reference via KO identifiers
ChEBI (Chemical Entities of Biological Interest)
Standardized chemical identifiers, SMILES representations, and compound class assignments
384 compounds with validated identifiers
Cross-reference via ChEBI IDs
ToxCSM
Machine learning-based toxicity predictions as annotation layers
370 compounds with 31 toxicity endpoints (nuclear, stress-response, genomic, environmental, dose-related)
Cross-reference via CPD identifiers
Regulatory Frameworks (ATSDR, EPA, CONAMA, IARC, EU-WFD, CEPA-PSL)
Priority compound lists and hazard classifications defining the scope of environmentally relevant compounds
9 international regulatory references
Compound inclusion criteria
What BioRemPP adds:
Compound-centric relational structure — Links compounds to genes, enzymes, pathways, and regulatory status through standardized identifiers (CAS, ChEBI, KO, SMILES), analogous to toxicogenomic frameworks that prioritize chemical-gene interactions
Cross-database harmonization — Resolves identifier inconsistencies and synonym mappings across sources, ensuring interoperability between KEGG, ChEBI, HADEG, and ToxCSM
Regulatory context integration — Associates functional annotations with environmental priority status from multiple international frameworks, structuring the database around environmentally prioritized pollutants rather than pathway catalogs alone
Analytical framework — Provides structured tidy data tables (10,869 entries with 100% completeness across core fields) optimized for bioremediation potential profiling, functional coverage analysis, and sample comparison
Users seeking primary sequence data, pathway diagrams, or detailed toxicological reports should consult the original databases directly. BioRemPP facilitates this navigation by maintaining traceable cross-references to source identifiers.
Reference Genomes
For demonstration and validation purposes, the database includes functional annotations from nine representative RefSeq genomes spanning principal bioremediation-relevant groups:
Bacteria: Acinetobacter baumannii, Enterobacter asburiae, Pseudomonas aeruginosa
Fungi: Aspergillus nidulans, Fusarium graminearum, Cryptococcus gattii
Microalgae/Cyanobacteria: Chlorella variabilis, Nannochloropsis gaditana, Synechocystis sp.
File Formats
All data tables are provided in CSV format with UTF-8 encoding. Detailed field descriptions and data dictionaries are included in the accompanying documentation (biorempp-schemas).
Associated Web Server
The BioRemPP web server (https://bioinfo.imd.ufrn.br/biorempp/) provides interactive visualization and analysis tools for exploring this database across.eight analytical modules (56 use cases) supporting hypothesis generation.
License
This dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Third-party data sources retain their original licenses.
Version History
v1.0.0: Initial release
Contact
For questions or feedback, please contact biorempp@gmail.com or submit issues via the project repository https://github.com/BioRemPP/biorempp_web/issues.
Keywords
bioremediation, biodegradation, xenobiotics, environmental microbiology, functional genomics, KEGG, pollutants, regulatory compounds, compound-gene associations, toxicology, environmental biotechnology, metagenomics
Metadata Fields
Field
Value
Resource Type
Dataset
Title
BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling
Version
1.0.0
License
Creative Commons Attribution 4.0 International (CC BY 4.0)
Language
English
Subjects
Environmental Sciences, Bioinformatics, Microbiology, Biotechnology
提供机构:
Zenodo
创建时间:
2026-03-07



