five

hukuang/Cerebus_v2

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/hukuang/Cerebus_v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation tags: - genomics - metagenomics - microbiome - biology size_categories: - 1B<n<10B --- # Cerebus_v2: Microbial Genome Contig Database Each row in the Cerebus_v2 dataset represents a single contig (contiguous DNA sequence) from a microbial genome. Genomes are organized into three major datasets — **GTDB** (isolate genomes), **IMG/PR** (plasmid/phage), and **metagenomic** (metagenome-assembled genomes from diverse environments). Where available, each contig is annotated with bin-level metadata including the `genome_id` it belongs to, its taxonomic classification (`taxonomy`), genome quality scores (`completeness` and `contamination` from CheckM), and a `species_cluster` assignment. ## Schema | Column | Type | Description | |--------|------|-------------| | `dataset` | string | Top-level dataset (GTDB, IMGPR, metagenomic) | | `source` | string | Data source (e.g. GTDB_r220, NCBI, HumanGut_UMGS, MGnify_chicken_gut) | | `file` | string | Original genome file name | | `contig` | string | Contig header from the FASTA file | | `sequence` | string | Nucleotide sequence | | `genome_id` | string | Genome/bin identifier linking all contigs from the same genome | | `taxonomy` | string | Taxonomic classification (format varies by source) | | `completeness` | float | Genome completeness (%) from CheckM, where available | | `contamination` | float | Genome contamination (%) from CheckM, where available | | `species_cluster` | string | Species cluster representative, where available | ## Dataset Summary | Dataset | Files | Size | Description | |---------|-------|------|-------------| | GTDB | 36 | 101 GB | GTDB r220 isolate genomes | | IMG/PR | 1 | 4.8 GB | IMG/PR plasmid and phage sequences | | Metagenomic | 234 | 636 GB | MAGs from NCBI, UHGG, UMGS, GEM, Youngblut, GPD, and MGnify biome catalogs | | **Total** | **271** | **742 GB** | | ## Examples ### Metagenomic (Human Gut UMGS source) | Column | Value | |--------|-------| | `dataset` | metagenomic | | `source` | HumanGut_UMGS | | `file` | DRR042264_bin.1.fa | | `contig` | NODE_10_length_190864_cov_5.081646 | | `genome_id` | DRR042264_bin.1 | | `taxonomy` | k\_\_Bacteria;p\_\_Tenericutes;c\_\_Mollicutes;o\_\_Erysipelotrichales;f\_\_Erysipelotrichaceae;g\_\_Solobacterium | | `completeness` | 91.98 | | `contamination` | 1.65 | ### Metagenomic (NCBI source) | Column | Value | |--------|-------| | `dataset` | metagenomic | | `source` | NCBI | | `file` | GCF_000003135.1_ASM313v1_genomic.fna | | `genome_id` | GCF_000003135.1 | | `taxonomy` | Bifidobacterium longum subsp. longum ATCC 55813 | ### GTDB | Column | Value | |--------|-------| | `dataset` | GTDB | | `source` | GTDB_r220 | | `file` | GCA_000008085.1_genomic.fna | | `contig` | AE017199.1 Nanoarchaeum equitans Kin4-M chromosome, complete genome | | `genome_id` | GCA_000008085.1_genomic.fna | ## Usage ```python import pyarrow.parquet as pq # Read a single file table = pq.read_table("GTDB/gtdb_0000.parquet") # Filter by genome_id to get all contigs from one bin import pyarrow.compute as pc mask = pc.equal(table["genome_id"], "GCA_000008085.1_genomic.fna") genome = table.filter(mask) ``` ## File Format All files are stored as Parquet with zstd compression. The `genome_id` field links all contigs belonging to the same bin, allowing users to group contigs by genome and access quality and taxonomic annotations at the bin level.
提供机构:
hukuang
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作