five

forkjoin-ai/bitwise-genomes

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/forkjoin-ai/bitwise-genomes
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - feature-extraction - text-classification tags: - biology - genomics - dna - cancer - bioinformatics - binary-encoding - knot-theory pretty_name: Bitwise Genome Datasets size_categories: - 10K<n<100K --- # Bitwise Genome Datasets DNA sequences encoded as 2-bit binary with topological annotations. **4x smaller than FASTA. Searchable at 32 bases per CPU cycle.** ## Format ``` A = 00, C = 01, G = 10, T = 11 4 bases per byte. The DNA IS the binary. ``` Each `.bw` file contains: - Magic header `0x4257` ("BW") - Sequence length (4 bytes, big-endian) - Header string (variable length) - Packed 2-bit bases ## Tools ### `bw` -- DNA ripgrep Install the [Bitwise](https://github.com/forkjoin-ai/bitwise) CLI to search these datasets: ```bash cargo install --path . # from the bitwise repo # Search for a pattern across a gene bw grep GGTGGCGTAGGC cancer-genes/fasta/KRAS.fasta # Count mutations between reference and tumor bw count reference.fasta tumor.fasta # Compression stats bw stats cancer-genes/fasta/BRCA1.fasta ``` Search speed: **90 million bases per second** on a single CPU core. ### Aeon FlowFrame Protocol These datasets stream natively as [Aeon FlowFrames](https://github.com/forkjoin-ai/aeon-flux): ``` stream_id = chromosome (1-25) sequence = genomic position flags = FORK | FOLD | VENT (structure type) payload = 2-bit packed bases ``` Wire = storage = memory. No serialization boundary. ### helix.repair Search these datasets live at **[helix.repair](https://helix.repair)** -- a DNA topology search engine powered by Bitwise encoding and 402 Lean theorems. ## Datasets ### cancer-genes/ 20 clinically important cancer genes from NCBI RefSeq: | Gene | Accession | Bases | Bitwise Size | Function | |------|-----------|-------|-------------|----------| | TP53 | NM_000546.6 | 2,512 | 628 B | Tumor suppressor ("guardian of the genome") | | BRCA1 | NM_007294.4 | 7,088 | 1,772 B | DNA repair (breast/ovarian cancer) | | BRCA2 | NM_000059.4 | 11,954 | 2,989 B | DNA repair (breast/ovarian/prostate) | | KRAS | NM_004985.5 | 5,306 | 1,327 B | GTPase (pancreatic/lung/colorectal) | | EGFR | NM_005228.5 | 9,905 | 2,477 B | Growth factor receptor (lung cancer) | | BRAF | NM_004333.6 | 6,459 | 1,615 B | Kinase (melanoma/colorectal) | | PIK3CA | NM_006218.4 | 9,259 | 2,315 B | PI3K catalytic (breast/endometrial) | | PTEN | NM_000314.8 | 8,515 | 2,129 B | Phosphatase (glioblastoma/prostate) | | APC | NM_000038.6 | 10,704 | 2,676 B | Wnt regulator (colorectal) | | RB1 | NM_000321.3 | 4,768 | 1,192 B | Retinoblastoma protein | | MYC | NM_002467.6 | 3,721 | 931 B | Transcription factor (many cancers) | | IDH1 | NM_005896.4 | 2,318 | 580 B | Isocitrate dehydrogenase (glioma) | | VHL | NM_000551.4 | 4,414 | 1,104 B | Von Hippel-Lindau (renal cancer) | | ALK | NM_004304.5 | 6,240 | 1,560 B | Receptor tyrosine kinase (lung/lymphoma) | | HER2 | NM_004448.4 | 4,557 | 1,140 B | ERBB2 (breast cancer) | | ATM | NM_000051.4 | 12,915 | 3,229 B | DNA damage response kinase | | MGMT | NM_002412.5 | 4,678 | 1,170 B | DNA methyltransferase (glioblastoma) | | TERT | NM_198253.3 | 4,039 | 1,010 B | Telomerase (many cancers) | | JAK2 | NM_004972.4 | 7,023 | 1,756 B | Janus kinase (myeloproliferative) | | FLT3 | NM_004119.3 | 3,826 | 957 B | FMS-like tyrosine kinase (AML) | ## Usage ### With `bw` CLI ```bash # Install cargo install --path . # Search for a mutation hotspot bw grep GGTGGCGTAGGC datasets/cancer-genes/fasta/KRAS.fasta # Pack FASTA to Bitwise binary bw pack datasets/cancer-genes/fasta/TP53.fasta > TP53.bw # Count mutations between sequences bw count ref.fasta tumor.fasta # Compression stats bw stats datasets/cancer-genes/fasta/BRCA1.fasta ``` ### With WASM (JavaScript/TypeScript) ```typescript import { pack_bases, search_packed, mutation_count } from 'bitwise'; const packed = pack_bases(new TextEncoder().encode('ATGCTAGCATGC')); const needle = pack_bases(new TextEncoder().encode('TAGC')); const matches = search_packed(packed, 12, needle, 4); // matches = [4] -- found TAGC at position 4 ``` ## Theory Every dataset is backed by mechanized Lean 4 theorems (zero sorry): - `dna_is_folded_knot`: DNA IS a folded knot (PsycheGrindExtended Pass 17) - `two_bit_four_per_byte`: 4 bases per byte by construction (Pass 39) - `word_parallel_speedup`: 32x search speedup (Pass 39) - `xor_detects_mutations`: XOR = mutation detection (Pass 39) - `noncoding_is_void`: non-coding DNA IS the void boundary (Pass 43) - `junk_not_junk`: "junk" DNA carries MORE information (Pass 43) - `sigma_monotone_with_age`: σ IS a molecular clock (GenomicVoidArchaeology) - `unwinding_theorem`: history reconstructible from void (GenomicVoidArchaeology) 402 theorems total. The math proves the encoding. The encoding enables the search. The search reveals the biology. ## Source All sequences from NCBI RefSeq (public domain). Fetched via E-utilities API. Reproducible via `scripts/fetch-and-convert.sh`. ## Related - [helix.repair](https://helix.repair) -- DNA topology search engine - [Aunt Sandy](https://github.com/forkjoin-ai/aunt-sandy) -- Cancer genomics via Buleyean probability - [Gnosis](https://github.com/forkjoin-ai/gnosis) -- Formal verification engine (402 Lean theorems) ## License Data: CC-BY-4.0 (sequences are public domain from NCBI) Code: MPL-2.0 ## hg38 -- Full Human Reference Genome **2.9GB FASTA → 736MB Bitwise binary. 25 chromosomes.** | Chromosome | Bases | Bitwise Size | |------------|-------|-------------| | chr1 | 248,956,422 | 59 MB | | chr2 | 242,193,529 | 58 MB | | chr3 | 198,295,559 | 47 MB | | chr4 | 190,214,555 | 45 MB | | chr5 | 181,538,259 | 43 MB | | chr6 | 170,805,979 | 41 MB | | chr7 | 159,345,973 | 38 MB | | chr8 | 145,138,636 | 35 MB | | chr9 | 138,394,717 | 33 MB | | chr10 | 133,797,422 | 32 MB | | chr11 | 135,086,622 | 32 MB | | chr12 | 133,275,309 | 32 MB | | chr13 | 114,364,328 | 27 MB | | chr14 | 107,043,718 | 26 MB | | chr15 | 101,991,189 | 24 MB | | chr16 | 90,338,345 | 22 MB | | chr17 | 83,257,441 | 20 MB | | chr18 | 80,373,285 | 19 MB | | chr19 | 58,617,616 | 14 MB | | chr20 | 64,444,167 | 15 MB | | chr21 | 46,709,983 | 11 MB | | chr22 | 50,818,468 | 12 MB | | chrX | 156,040,895 | 37 MB | | chrY | 57,227,415 | 14 MB | | chrM | 16,569 | 4.1 KB | **Too large for GitHub.** Reproduce locally: ```bash # Download and convert (requires ~4GB disk) bash scripts/fetch-and-convert-hg38.sh # Or use Cloud Build gcloud builds submit --config=cloudbuild-whole-genome.yaml --substitutions=_ASSEMBLY=hg38 . ``` Search speed: **90 million bases per second** on a single CPU core.
提供机构:
forkjoin-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作