five

Nhoodie/omni-dna-sad-mutation-dataset

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Nhoodie/omni-dna-sad-mutation-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - dna - genomics - mutation - synthetic - sad task_categories: - text-generation size_categories: - 1K<n<10K --- # Omni-DNA SAD Mutation Dataset Synthetic and real DNA mutation pairs for training cross-domain HGT mutation prediction models. ## Files | File | Pairs | Source | |------|-------|--------| | `synthetic_expanded.jsonl` | 8,112 | ICI dual-model generation (Omni + HyenaDNA consensus) | | `train.jsonl` | 3,317 | Real NCBI sequences | | `test.jsonl` | 826 | Real NCBI sequences (held-out) | ## Format Each line is a JSON object: ```json {"parent": "ATGGCT...", "child": "ATAGCT..."} ``` ## Generation Method (Synthetic Data) 1. **Source**: 1,014 real DNA sequences from diverse species 2. **FDI (Focus-Doped Interleaving)**: Every **3 codons** (9 bp), a 1-codon (3 bp) gap is introduced 3. **Dual-model consensus**: Omni-DNA-20M and HyenaDNA tiny-1k independently predict gap nucleotides 4. **Consensus tagging**: Agreement = `consensus`, disagreement = `contested` 5. **8 generation passes** with different gap intervals (3-6) and seeds, then deduplicated ### Key Statistics | Metric | Synthetic | Real (Train) | Real (Test) | |--------|-----------|-------------|-------------| | Mean mutation rate | 17.8% | 4.2% | 3.9% | | Mean sequence length | ~450 bp | ~350 bp | ~350 bp | | Consensus rate | 2.0% | N/A | N/A | ## Domains Sequences sourced from NCBI across diverse prokaryotic and archaeal species for cross-domain HGT analysis. ## SAD Coefficient The synthetic-to-real exposure ratio used in training: - **SAD Coefficient = 4.89** (81,120 synthetic exposures / 16,585 real exposures) - This is noted as too high — a coefficient of ~1.5 is recommended for future runs ## Citation If using this dataset, please also cite: - [Omni-DNA](https://huggingface.co/zehui127/Omni-DNA-20M) - [HyenaDNA](https://huggingface.co/LongSafari/hyenadna-tiny-1k-seqlen-hf)
提供机构:
Nhoodie
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作