Nhoodie/omni-dna-sad-mutation-dataset
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Nhoodie/omni-dna-sad-mutation-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- dna
- genomics
- mutation
- synthetic
- sad
task_categories:
- text-generation
size_categories:
- 1K<n<10K
---
# Omni-DNA SAD Mutation Dataset
Synthetic and real DNA mutation pairs for training cross-domain HGT mutation prediction models.
## Files
| File | Pairs | Source |
|------|-------|--------|
| `synthetic_expanded.jsonl` | 8,112 | ICI dual-model generation (Omni + HyenaDNA consensus) |
| `train.jsonl` | 3,317 | Real NCBI sequences |
| `test.jsonl` | 826 | Real NCBI sequences (held-out) |
## Format
Each line is a JSON object:
```json
{"parent": "ATGGCT...", "child": "ATAGCT..."}
```
## Generation Method (Synthetic Data)
1. **Source**: 1,014 real DNA sequences from diverse species
2. **FDI (Focus-Doped Interleaving)**: Every **3 codons** (9 bp), a 1-codon (3 bp) gap is introduced
3. **Dual-model consensus**: Omni-DNA-20M and HyenaDNA tiny-1k independently predict gap nucleotides
4. **Consensus tagging**: Agreement = `consensus`, disagreement = `contested`
5. **8 generation passes** with different gap intervals (3-6) and seeds, then deduplicated
### Key Statistics
| Metric | Synthetic | Real (Train) | Real (Test) |
|--------|-----------|-------------|-------------|
| Mean mutation rate | 17.8% | 4.2% | 3.9% |
| Mean sequence length | ~450 bp | ~350 bp | ~350 bp |
| Consensus rate | 2.0% | N/A | N/A |
## Domains
Sequences sourced from NCBI across diverse prokaryotic and archaeal species for cross-domain HGT analysis.
## SAD Coefficient
The synthetic-to-real exposure ratio used in training:
- **SAD Coefficient = 4.89** (81,120 synthetic exposures / 16,585 real exposures)
- This is noted as too high — a coefficient of ~1.5 is recommended for future runs
## Citation
If using this dataset, please also cite:
- [Omni-DNA](https://huggingface.co/zehui127/Omni-DNA-20M)
- [HyenaDNA](https://huggingface.co/LongSafari/hyenadna-tiny-1k-seqlen-hf)
提供机构:
Nhoodie



