hukuang/Cerebus_v2
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hukuang/Cerebus_v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
tags:
- genomics
- metagenomics
- microbiome
- biology
size_categories:
- 1B<n<10B
---
# Cerebus_v2: Microbial Genome Contig Database
Each row in the Cerebus_v2 dataset represents a single contig (contiguous DNA sequence) from a microbial genome. Genomes are organized into three major datasets — **GTDB** (isolate genomes), **IMG/PR** (plasmid/phage), and **metagenomic** (metagenome-assembled genomes from diverse environments). Where available, each contig is annotated with bin-level metadata including the `genome_id` it belongs to, its taxonomic classification (`taxonomy`), genome quality scores (`completeness` and `contamination` from CheckM), and a `species_cluster` assignment.
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `dataset` | string | Top-level dataset (GTDB, IMGPR, metagenomic) |
| `source` | string | Data source (e.g. GTDB_r220, NCBI, HumanGut_UMGS, MGnify_chicken_gut) |
| `file` | string | Original genome file name |
| `contig` | string | Contig header from the FASTA file |
| `sequence` | string | Nucleotide sequence |
| `genome_id` | string | Genome/bin identifier linking all contigs from the same genome |
| `taxonomy` | string | Taxonomic classification (format varies by source) |
| `completeness` | float | Genome completeness (%) from CheckM, where available |
| `contamination` | float | Genome contamination (%) from CheckM, where available |
| `species_cluster` | string | Species cluster representative, where available |
## Dataset Summary
| Dataset | Files | Size | Description |
|---------|-------|------|-------------|
| GTDB | 36 | 101 GB | GTDB r220 isolate genomes |
| IMG/PR | 1 | 4.8 GB | IMG/PR plasmid and phage sequences |
| Metagenomic | 234 | 636 GB | MAGs from NCBI, UHGG, UMGS, GEM, Youngblut, GPD, and MGnify biome catalogs |
| **Total** | **271** | **742 GB** | |
## Examples
### Metagenomic (Human Gut UMGS source)
| Column | Value |
|--------|-------|
| `dataset` | metagenomic |
| `source` | HumanGut_UMGS |
| `file` | DRR042264_bin.1.fa |
| `contig` | NODE_10_length_190864_cov_5.081646 |
| `genome_id` | DRR042264_bin.1 |
| `taxonomy` | k\_\_Bacteria;p\_\_Tenericutes;c\_\_Mollicutes;o\_\_Erysipelotrichales;f\_\_Erysipelotrichaceae;g\_\_Solobacterium |
| `completeness` | 91.98 |
| `contamination` | 1.65 |
### Metagenomic (NCBI source)
| Column | Value |
|--------|-------|
| `dataset` | metagenomic |
| `source` | NCBI |
| `file` | GCF_000003135.1_ASM313v1_genomic.fna |
| `genome_id` | GCF_000003135.1 |
| `taxonomy` | Bifidobacterium longum subsp. longum ATCC 55813 |
### GTDB
| Column | Value |
|--------|-------|
| `dataset` | GTDB |
| `source` | GTDB_r220 |
| `file` | GCA_000008085.1_genomic.fna |
| `contig` | AE017199.1 Nanoarchaeum equitans Kin4-M chromosome, complete genome |
| `genome_id` | GCA_000008085.1_genomic.fna |
## Usage
```python
import pyarrow.parquet as pq
# Read a single file
table = pq.read_table("GTDB/gtdb_0000.parquet")
# Filter by genome_id to get all contigs from one bin
import pyarrow.compute as pc
mask = pc.equal(table["genome_id"], "GCA_000008085.1_genomic.fna")
genome = table.filter(mask)
```
## File Format
All files are stored as Parquet with zstd compression. The `genome_id` field links all contigs belonging to the same bin, allowing users to group contigs by genome and access quality and taxonomic annotations at the bin level.
提供机构:
hukuang



