five

metaXu264/crispr-cas-atlas-generator

收藏
Hugging Face2026-01-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/metaXu264/crispr-cas-atlas-generator
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-generation language: - dna pretty_name: CRISPR-Cas Atlas (GENERator-ready) --- # CRISPR-Cas Atlas – GENERator-ready Dataset This dataset is a derived, preprocessed version of the CRISPR-Cas Atlas, formatted for causal language model fine-tuning with GENERator. ## Source Original dataset: **CRISPR-Cas Atlas v1.0** Ruffolo et al., *Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences*, bioRxiv (2024) https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1 ## Processing Each CRISPR-Cas operon was converted into a single DNA sequence by concatenating: 1. CRISPR repeat 2. CRISPR spacers 3. tracrRNA (RNA converted to DNA via U→T) All sequences: - Contain only A/C/G/T - Are left-padded with `A` to ensure the sequence length is a multiple of 6 - Are compatible with GENERator’s 6-mer tokenizer - Are suitable for causal language model fine-tuning ## Dataset Format Each line in the JSONL file contains a single field: ```json { "sequence": "ACGTACGT..." } ```` ## License This dataset is released under the Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0) license. Commercial use is strictly prohibited. ## Citation ```bibtex @article{profluent2024opencrispr, title={Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences}, author={Ruffolo, Jeffrey A and Nayfach, Stephen and Gallagher, Joseph and Bhatnagar, Aadyot and Beazer, Joel and Hussain, Riffat and Russ, Jordan and Yip, Jennifer and Hill, Emily and Pacesa, Martin and others}, journal={bioRxiv}, pages={2024--04}, year={2024}, publisher={Cold Spring Harbor Laboratory} }

许可证:CC BY-NC 4.0 任务类别: - 文本生成 数据语种: - DNA(脱氧核糖核酸) 展示名称:CRISPR-Cas Atlas(GENERATOR就绪版) # CRISPR-Cas Atlas——GENERATOR就绪数据集 本数据集为CRISPR-Cas Atlas的衍生预处理版本,针对GENERATOR的因果语言模型微调任务完成了格式适配。 ## 数据源 原始数据集:**CRISPR-Cas Atlas v1.0** 鲁福洛等学者,《通过建模CRISPR-Cas序列全域设计高功能基因组编辑器》,发表于bioRxiv(2024) https://www.biorxiv.org/content/10.1101/2024.04.22.590591v1 ## 预处理流程 每个CRISPR-Cas操纵子通过拼接以下内容转换为单条DNA序列: 1. CRISPR重复序列 2. CRISPR间隔序列 3. tracrRNA(通过将U替换为T转换为DNA序列) 所有序列满足以下要求: - 仅包含A/C/G/T四种碱基 - 左侧使用`A`进行填充,确保序列长度为6的整数倍 - 适配GENERATOR的6-mer分词器 - 可直接用于因果语言模型微调 ## 数据集格式 JSONL文件的每一行仅包含一个字段: json { "sequence": "ACGTACGT..." } ## 许可证 本数据集采用知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)发布,严格禁止商业用途。 ## 引用格式 bibtex @article{profluent2024opencrispr, title={Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences}, author={Ruffolo, Jeffrey A and Nayfach, Stephen and Gallagher, Joseph and Bhatnagar, Aadyot and Beazer, Joel and Hussain, Riffat and Russ, Jordan and Yip, Jennifer and Hill, Emily and Pacesa, Martin and others}, journal={bioRxiv}, pages={2024--04}, year={2024}, publisher={Cold Spring Harbor Laboratory} }
提供机构:
metaXu264
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作