parkneurals/ecoli-essentiality-cub
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/parkneurals/ecoli-essentiality-cub
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-classification
language:
- en
tags:
- bioinformatics
- genomics
- codon-usage-bias
- gene-essentiality
- e-coli
- machine-learning
pretty_name: E. coli Gene Essentiality — Codon Usage Bias Features
size_categories:
- 1K<n<10K
---
1,503 synthetic *E. coli* K-12-like coding sequences
with 114 codon usage bias (CUB) features per gene and a binary essentiality label.
Distributions are calibrated to published *E. coli* K-12 MG1655 statistics
(Sharp & Li 1987; Rocha 2004; Ikemura 1985).
## Label
| Value | Meaning | Count |
|-------|---------|-------|
| 1 | Essential | 303 (20.2%) |
| 0 | Non-essential | 1,200 (79.8%) |
## Feature Groups (114 total)
| Group | Count | Examples |
|-------|-------|---------|
| Primary CUB indices | 5 | `cai`, `enc`, `gc3s`, `fop`, `gravy` |
| GC by codon position | 4 | `gc1`, `gc2`, `gc3`, `gc_overall` |
| Gene structure | 1 | `gene_length_codons` |
| Dinucleotide biases (key) | 7 | `cpg_bias`, `ta_bias` |
| RSCU per codon | 61 | `rscu_CTG`, `rscu_AAA`, ... |
| All dinucleotide biases | 16 | `dinuc_CG`, `dinuc_AT`, ... |
| Amino acid fractions | 20 | `aa_frac_A`, `aa_frac_L`, ... |
## Columns
- `gene_id` — unique identifier (`ESS_XXXX` or `NON_XXXX`)
- `cds_sequence` — synthetic DNA coding sequence (starts ATG, ends TAA)
- `label` — binary label (1 = essential, 0 = non-essential)
- `label_str` — human-readable label string
- _114 feature columns_ — see table above
## Biological Calibration
| Parameter | Essential | Non-essential | Source |
|-----------|-----------|---------------|--------|
| CAI target (mean) | 0.72 | 0.55 | Rocha (2004) |
| CAI std | 0.07 | 0.12 | Sharp & Li (1987) |
| GC3s target (mean) | 0.53 | 0.49 | Rocha (2004) |
| Gene length (codons, log-mean) | ~230 | ~310 | NCBI K-12 proteome |
许可证:CC BY 4.0
任务类别:表格分类(tabular-classification)
语言:英语
标签:生物信息学(bioinformatics)、基因组学(genomics)、密码子使用偏好(Codon Usage Bias)、基因必需性(gene-essentiality)、大肠杆菌(E. coli)、机器学习(machine-learning)
展示名称:大肠杆菌(E. coli)基因必需性——密码子使用偏好特征数据集
样本量区间:1000 < 样本数量 < 10000
本数据集包含1503条合成的类大肠杆菌(E. coli)K-12编码序列,每条基因对应114个密码子使用偏好(Codon Usage Bias,以下简称CUB)特征与一个二分类基因必需性标签。数据集的特征分布校准自已发表的大肠杆菌(E. coli)K-12 MG1655菌株统计数据(Sharp & Li 1987; Rocha 2004; Ikemura 1985)。
## 标签说明
| 标签值 | 含义 | 样本数 |
|-------|---------|-------|
| 1 | 必需基因 | 303(20.2%) |
| 0 | 非必需基因 | 1200(79.8%) |
## 特征组(共114组)
| 特征组类别 | 特征数量 | 示例 |
|-------|-------|---------|
| 主要CUB指数 | 5 | `cai`、`enc`、`gc3s`、`fop`、`gravy` |
| 密码子位点GC含量 | 4 | `gc1`、`gc2`、`gc3`、`gc_overall` |
| 基因结构 | 1 | `gene_length_codons` |
| 核心二核苷酸偏好 | 7 | `cpg_bias`、`ta_bias`等 |
| 单密码子相对同义密码子使用度(Relative Synonymous Codon Usage, RSCU) | 61 | `rscu_CTG`、`rscu_AAA`等 |
| 全二核苷酸偏好 | 16 | `dinuc_CG`、`dinuc_AT`等 |
| 氨基酸占比 | 20 | `aa_frac_A`、`aa_frac_L`等 |
## 数据列
- `gene_id`:唯一标识符,格式为`ESS_XXXX`或`NON_XXXX`
- `cds_sequence`:合成DNA编码序列,起始密码子为ATG,终止密码子为TAA
- `label`:二分类标签(1表示必需基因,0表示非必需基因)
- `label_str`:人类可读的标签文本
- 共114个特征列:详见上文特征组表格
## 生物学校准参数
| 参数 | 必需基因组 | 非必需基因组 | 来源 |
|-----------|-----------|---------------|--------|
| 密码子适应指数(Codon Adaptation Index, CAI)目标均值 | 0.72 | 0.55 | Rocha (2004) |
| CAI标准差 | 0.07 | 0.12 | Sharp & Li (1987) |
| 第三位密码子GC含量(GC3s)目标均值 | 0.53 | 0.49 | Rocha (2004) |
| 密码子长度(对数均值) | ~230 | ~310 | NCBI K-12蛋白质组 |
提供机构:
parkneurals



