neuralbioinfo/biom
收藏Hugging Face2026-02-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/neuralbioinfo/biom
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: sequence_id
dtype: int64
- name: cluster_id
dtype: int64
- name: rep_id
dtype: int64
- name: redundancy
dtype: string
- name: is_ambiguous
dtype: string
- name: is_skani_viral
dtype: string
- name: is_rs10
dtype: string
- name: is_checkv
dtype: bool
- name: fraction
dtype: string
- name: biome
dtype: string
- name: y
dtype: int64
- name: label
dtype: string
- name: sample_id
dtype: string
- name: fasta_id
dtype: string
- name: orientation
dtype: string
- name: pair_id
dtype: string
- name: Project ID
dtype: string
- name: SRA/ERR Run
dtype: string
- name: length_category
dtype: string
- name: length_category_encoded
dtype: int64
- name: Benchmark_exclusion
dtype: int64
- name: seq_len
dtype: int64
- name: sequence
dtype: string
splits:
- name: train
num_bytes: 4243077700
num_examples: 1041053
download_size: 1888413845
dataset_size: 4243077700
---
# biom
## Overview
This dataset contains long DNA contig sequences derived from metagenomic samples, together with extensive metadata and classification labels.
## Configs and splits
- Config: default
- Split: train
## Column descriptions
| Column | Type | Description | Observed values / notes |
|-------|------|-------------|-------------------------|
| sequence_id | int64 | Unique integer identifier for each sequence record. | Sequential integers |
| cluster_id | int64 | Identifier of a redundancy cluster grouping highly similar sequences. | Integer cluster IDs |
| rep_id | int64 | Identifier of the representative sequence within a cluster. | Matches representative entries |
| redundancy | string | Redundancy status of the sequence relative to clustering. | cluster_representative |
| is_ambiguous | string | Indicates whether the sequence contains ambiguous bases or failed ambiguity checks. | ambiguous, not_ambiguous |
| is_skani_viral | string | Result of a viral similarity check (e.g. ANI/skani-based). | similar2v, not_similar2v |
| is_rs10 | string | Dataset-specific categorical flag. | not_rs10 |
| is_checkv | bool | Boolean flag related to CheckV processing or filtering. | false |
| fraction | string | High-level category of sequence origin. | microbial |
| biome | string | Environmental biome associated with the sample. | tomato soil |
| y | int64 | Numeric class label used for supervised learning. | 0 |
| label | string | Human-readable class label corresponding to y. | non_phage |
| sample_id | string | Sample or run identifier, typically an SRA-style accession. | e.g. SRR8487012 |
| fasta_id | string | Identifier from the source FASTA or assembly, often encoding node, length, and coverage. | SRR…_NODE_* |
| orientation | string | Orientation of the sequence in the original assembly. | forward |
| pair_id | string | Dataset-specific grouping identifier. | R7 |
| Project ID | string | BioProject accession associated with the sample. | PRJNA646779 |
| SRA/ERR Run | string | SRA or ENA run accession. | SRR8487012 |
| length_category | string | Length bin derived from sequence length. | 10k - 50k |
| length_category_encoded | int64 | Integer encoding of length_category. | 4 |
| Benchmark_exclusion | int64 | Flag indicating whether the record should be excluded from benchmarking. | 0 or 1 |
| seq_len | int64 | Length of the DNA sequence in base pairs. | ~29k–47k in samples |
| sequence | string | DNA sequence consisting of A, C, G, and T characters. Length matches seq_len. | Long nucleotide strings |
## Labels
The dataset includes both numeric and string labels:
- y = 0 corresponds to label = non_phage
Additional label values may exist outside the sampled records.
提供机构:
neuralbioinfo



