five

BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling

收藏
DataCite Commons2026-05-06 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18905194
下载链接
链接失效反馈
官方服务:
资源简介:
Description Overview The BioRemPP Database (Bioremediation Potential Profile Database) is a curated, integrated resource designed to support environmental bioremediation research by systematically linking chemical compounds, genes, enzymes, and regulatory frameworks. This database addresses a critical gap in bioremediation science: the absence of a unified, standardized resource connecting priority pollutant with their potential biodegradation pathways across multiple knowledge bases and regulatory contexts. Scientific Rationale Environmental contamination by xenobiotic compounds—including chlorinated solvents, polyaromatic hydrocarbons, pesticides, and heavy metals—poses significant ecological and public health challenges. While substantial knowledge exists regarding microbial biodegradation capabilities, this information remains fragmented across databases and regulatory frameworks. BioRemPP systematically integrates these sources into a unified, FAIR-compliant (Findable, Accessible, Interoperable, Reusable) framework. Database Contents (v1.0.0) This release contains: 10,869 database entries linking compounds to functional annotations 384 unique chemical compounds with standardized identifiers (CAS, ChEBI, SMILES) 1,541 KEGG Orthology (KO) identifiers for functional annotation 12 chemical compound classes for classification Data Sources and Integration Data were curated from multiple authoritative sources: Regulatory Frameworks: ATSDR (Agency for Toxic Substances and Disease Registry) Substance Priority List EPA (U.S. Environmental Protection Agency) National Priorities List CONAMA (Brazilian National Environment Council) Regulations IARC (International Agency for Research on Cancer) Classifications (Groups 1, 2A, 2B) EU Water Framework Directive Priority Substances Canadian Environmental Protection Act Priority Substances List Functional Annotations: KEGG (Kyoto Encyclopedia of Genes and Genomes) — KO identifiers, pathways, enzymes HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) — degradation-specific coverage BlastKOALA (v3.0) and EggNOG-mapper (v2) — sequence-based functional assignments Chemical Classification: ChEBI (Chemical Entities of Biological Interest) — standardized identifiers, SMILES, compound classes Toxicological Annotations: ToxCSM — machine learning-based multi-endpoint toxicity predictions (mutagenicity, carcinogenicity, environmental hazard) Integration Framework and External Database Contributions BioRemPP is not a copy or aggregation of existing databases. Rather, it is an integration layer that establishes compound-centric relationships across independent external resources, each contributing distinct and complementary information. BioRemPP does not redistribute primary data from these sources; instead, it provides standardized cross-references and relational mappings that enable users to navigate between resources through a unified analytical framework. The data architecture organizes compound-gene-enzyme relationships within a centralized core containing 1,541 unique KOs and 384 compounds. This core is linked to three external resources that expand functional and toxicological coverage through cross-references, not data duplication. The following external databases contribute specific layers of information to BioRemPP: External Resource Contribution to BioRemPP Quantitative Coverage Relationship Type KEGG (Kyoto Encyclopedia of Genes and Genomes) KO identifiers, gene symbols, enzyme classifications (EC), and pathway associations for functional annotation 855 entries from 20 canonical xenobiotic metabolism pathways Cross-reference via KO identifiers HADEG (Hydrocarbon Aerobic Degradation Enzymes and Genes) Extends degradation-specific gene coverage for hydrocarbons, polymers, and biosurfactant-related pathways 867 entries across 71 sub-pathways Cross-reference via KO identifiers ChEBI (Chemical Entities of Biological Interest) Standardized chemical identifiers, SMILES representations, and compound class assignments 384 compounds with validated identifiers Cross-reference via ChEBI IDs ToxCSM Machine learning-based toxicity predictions as annotation layers 370 compounds with 31 toxicity endpoints (nuclear, stress-response, genomic, environmental, dose-related) Cross-reference via CPD identifiers Regulatory Frameworks (ATSDR, EPA, CONAMA, IARC, EU-WFD, CEPA-PSL) Priority compound lists and hazard classifications defining the scope of environmentally relevant compounds 9 international regulatory references Compound inclusion criteria What BioRemPP adds: Compound-centric relational structure — Links compounds to genes, enzymes, pathways, and regulatory status through standardized identifiers (CAS, ChEBI, KO, SMILES), analogous to toxicogenomic frameworks that prioritize chemical-gene interactions Cross-database harmonization — Resolves identifier inconsistencies and synonym mappings across sources, ensuring interoperability between KEGG, ChEBI, HADEG, and ToxCSM Regulatory context integration — Associates functional annotations with environmental priority status from multiple international frameworks, structuring the database around environmentally prioritized pollutants rather than pathway catalogs alone Analytical framework — Provides structured tidy data tables (10,869 entries with 100% completeness across core fields) optimized for bioremediation potential profiling, functional coverage analysis, and sample comparison Users seeking primary sequence data, pathway diagrams, or detailed toxicological reports should consult the original databases directly. BioRemPP facilitates this navigation by maintaining traceable cross-references to source identifiers. Reference Genomes For demonstration and validation purposes, the database includes functional annotations from nine representative RefSeq genomes spanning principal bioremediation-relevant groups: Bacteria: Acinetobacter baumannii, Enterobacter asburiae, Pseudomonas aeruginosa Fungi: Aspergillus nidulans, Fusarium graminearum, Cryptococcus gattii Microalgae/Cyanobacteria: Chlorella variabilis, Nannochloropsis gaditana, Synechocystis sp. File Formats All data tables are provided in CSV format with UTF-8 encoding. Detailed field descriptions and data dictionaries are included in the accompanying documentation (biorempp-schemas). Associated Web Server The BioRemPP web server (https://bioinfo.imd.ufrn.br/biorempp/) provides interactive visualization and analysis tools for exploring this database across.eight analytical modules (56 use cases) supporting hypothesis generation. License This dataset is released under Creative Commons Attribution 4.0 International (CC BY 4.0). Third-party data sources retain their original licenses. Version History v1.0.0: Initial release Contact For questions or feedback, please contact biorempp@gmail.com or submit issues via the project repository https://github.com/BioRemPP/biorempp_web/issues. Keywords  bioremediation, biodegradation, xenobiotics, environmental microbiology, functional genomics, KEGG, pollutants, regulatory compounds, compound-gene associations, toxicology, environmental biotechnology, metagenomics Metadata Fields Field Value Resource Type Dataset Title BioRemPP Database: A Curated Compound-Centric Resource for Bioremediation Potential Profiling Version 1.0.0 License Creative Commons Attribution 4.0 International (CC BY 4.0) Language English Subjects Environmental Sciences, Bioinformatics, Microbiology, Biotechnology
提供机构:
Zenodo
创建时间:
2026-03-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作