metaXu264/crispr-cas-atlas-generator
收藏Hugging Face2026-01-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/metaXu264/crispr-cas-atlas-generator
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-generation
language:
- dna
pretty_name: CRISPR-Cas Atlas (GENERator-ready)
---
# CRISPR-Cas Atlas – GENERator-ready Dataset
This dataset is a derived, preprocessed version of the CRISPR-Cas Atlas, formatted for causal language model fine-tuning with GENERator.
## Source
Original dataset:
**CRISPR-Cas Atlas v1.0**
Ruffolo et al., *Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences*, bioRxiv (2024)
https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1
## Processing
Each CRISPR-Cas operon was converted into a single DNA sequence by concatenating:
1. CRISPR repeat
2. CRISPR spacers
3. tracrRNA (RNA converted to DNA via U→T)
All sequences:
- Contain only A/C/G/T
- Are left-padded with `A` to ensure the sequence length is a multiple of 6
- Are compatible with GENERator’s 6-mer tokenizer
- Are suitable for causal language model fine-tuning
## Dataset Format
Each line in the JSONL file contains a single field:
```json
{ "sequence": "ACGTACGT..." }
````
## License
This dataset is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license.
Commercial use is strictly prohibited.
## Citation
```bibtex
@article{profluent2024opencrispr,
title={Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences},
author={Ruffolo, Jeffrey A and Nayfach, Stephen and Gallagher, Joseph and Bhatnagar, Aadyot and Beazer, Joel and Hussain, Riffat and Russ, Jordan and Yip, Jennifer and Hill, Emily and Pacesa, Martin and others},
journal={bioRxiv},
pages={2024--04},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
许可证:CC BY-NC 4.0
任务类别:
- 文本生成
数据语种:
- DNA(脱氧核糖核酸)
展示名称:CRISPR-Cas Atlas(GENERATOR就绪版)
# CRISPR-Cas Atlas——GENERATOR就绪数据集
本数据集为CRISPR-Cas Atlas的衍生预处理版本,针对GENERATOR的因果语言模型微调任务完成了格式适配。
## 数据源
原始数据集:**CRISPR-Cas Atlas v1.0**
鲁福洛等学者,《通过建模CRISPR-Cas序列全域设计高功能基因组编辑器》,发表于bioRxiv(2024)
https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1
## 预处理流程
每个CRISPR-Cas操纵子通过拼接以下内容转换为单条DNA序列:
1. CRISPR重复序列
2. CRISPR间隔序列
3. tracrRNA(通过将U替换为T转换为DNA序列)
所有序列满足以下要求:
- 仅包含A/C/G/T四种碱基
- 左侧使用`A`进行填充,确保序列长度为6的整数倍
- 适配GENERATOR的6-mer分词器
- 可直接用于因果语言模型微调
## 数据集格式
JSONL文件的每一行仅包含一个字段:
json
{ "sequence": "ACGTACGT..." }
## 许可证
本数据集采用知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)发布,严格禁止商业用途。
## 引用格式
bibtex
@article{profluent2024opencrispr,
title={Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences},
author={Ruffolo, Jeffrey A and Nayfach, Stephen and Gallagher, Joseph and Bhatnagar, Aadyot and Beazer, Joel and Hussain, Riffat and Russ, Jordan and Yip, Jennifer and Hill, Emily and Pacesa, Martin and others},
journal={bioRxiv},
pages={2024--04},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
提供机构:
metaXu264



