five

neuralbioinfo/biom

收藏
Hugging Face2026-02-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/neuralbioinfo/biom
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: sequence_id dtype: int64 - name: cluster_id dtype: int64 - name: rep_id dtype: int64 - name: redundancy dtype: string - name: is_ambiguous dtype: string - name: is_skani_viral dtype: string - name: is_rs10 dtype: string - name: is_checkv dtype: bool - name: fraction dtype: string - name: biome dtype: string - name: y dtype: int64 - name: label dtype: string - name: sample_id dtype: string - name: fasta_id dtype: string - name: orientation dtype: string - name: pair_id dtype: string - name: Project ID dtype: string - name: SRA/ERR Run dtype: string - name: length_category dtype: string - name: length_category_encoded dtype: int64 - name: Benchmark_exclusion dtype: int64 - name: seq_len dtype: int64 - name: sequence dtype: string splits: - name: train num_bytes: 4243077700 num_examples: 1041053 download_size: 1888413845 dataset_size: 4243077700 --- # biom ## Overview This dataset contains long DNA contig sequences derived from metagenomic samples, together with extensive metadata and classification labels. ## Configs and splits - Config: default - Split: train ## Column descriptions | Column | Type | Description | Observed values / notes | |-------|------|-------------|-------------------------| | sequence_id | int64 | Unique integer identifier for each sequence record. | Sequential integers | | cluster_id | int64 | Identifier of a redundancy cluster grouping highly similar sequences. | Integer cluster IDs | | rep_id | int64 | Identifier of the representative sequence within a cluster. | Matches representative entries | | redundancy | string | Redundancy status of the sequence relative to clustering. | cluster_representative | | is_ambiguous | string | Indicates whether the sequence contains ambiguous bases or failed ambiguity checks. | ambiguous, not_ambiguous | | is_skani_viral | string | Result of a viral similarity check (e.g. ANI/skani-based). | similar2v, not_similar2v | | is_rs10 | string | Dataset-specific categorical flag. | not_rs10 | | is_checkv | bool | Boolean flag related to CheckV processing or filtering. | false | | fraction | string | High-level category of sequence origin. | microbial | | biome | string | Environmental biome associated with the sample. | tomato soil | | y | int64 | Numeric class label used for supervised learning. | 0 | | label | string | Human-readable class label corresponding to y. | non_phage | | sample_id | string | Sample or run identifier, typically an SRA-style accession. | e.g. SRR8487012 | | fasta_id | string | Identifier from the source FASTA or assembly, often encoding node, length, and coverage. | SRR…_NODE_* | | orientation | string | Orientation of the sequence in the original assembly. | forward | | pair_id | string | Dataset-specific grouping identifier. | R7 | | Project ID | string | BioProject accession associated with the sample. | PRJNA646779 | | SRA/ERR Run | string | SRA or ENA run accession. | SRR8487012 | | length_category | string | Length bin derived from sequence length. | 10k - 50k | | length_category_encoded | int64 | Integer encoding of length_category. | 4 | | Benchmark_exclusion | int64 | Flag indicating whether the record should be excluded from benchmarking. | 0 or 1 | | seq_len | int64 | Length of the DNA sequence in base pairs. | ~29k–47k in samples | | sequence | string | DNA sequence consisting of A, C, G, and T characters. Length matches seq_len. | Long nucleotide strings | ## Labels The dataset includes both numeric and string labels: - y = 0 corresponds to label = non_phage Additional label values may exist outside the sampled records.
提供机构:
neuralbioinfo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作