five

LucaGroup/LucaVirus-OpenVirus-Gene

收藏
Hugging Face2026-01-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LucaGroup/LucaVirus-OpenVirus-Gene
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit tags: - Biology - Bioinformatics - Virus - Genomics - Proteomics - Nucleotide - Protein - Foundation Model - LucaVirus - LucaVirus-Gene - AI4Bio - AI4Science - Nucleotide-Protein task_categories: - feature-extraction size_categories: - 10M<n<100M --- # Dataset Card for LucaVirus-OpenVirus-Gene ## 1. Dataset Summary **LucaVirus-OpenVirus-Gene** is a large-scale genomic dataset consisting exclusively of viral nucleotide sequences. It is a specialized subset of the **OpenVirus** corpus, curated specifically for the pre-training of the **LucaVirus-Gene** foundation model. By focusing purely on viral genomes, this dataset provides a high-density corpus of **10.4 million** sequences, enabling models to capture the intricate evolutionary patterns, regulatory motifs, and genomic architectures of DNA and RNA viruses. ## 2. Dataset Statistics The dataset focuses solely on nucleotide sequences (genomes, genes, and fragments): | Feature | Count / Description | | :--- | :--- | | **Total Sequences** | 10.4 Million | | **Sequence Type** | Nucleotide (DNA/RNA) | | **`obj_type` Identifier** | `gene` (Exclusive) | | **Primary Use** | Pre-training for LucaVirus-Gene | ## 3. Data Structure & Format ### 3.1 File Organization The dataset is provided as a compressed **`.tar`** archive. Upon extraction, the data is partitioned into three standard machine-learning subsets: ```text LucaVirus-OpenVirus-Gene/dataset/v1.0/ ├── train/ # Training set (primary corpus for genomic pre-training) ├── dev/ # Validation set (for model selection and tuning) └── test/ # Test set (for final evaluation and benchmarking) ``` Each directory (`train`, `dev`, `test`) contains one or more **CSV files** with headers. ### 3.2 CSV Schema All CSV files follow a consistent four-column schema: | Column Name | Description | Details | | :--- | :--- |:-------------------------------------------------------------------------------------------------------| | **`obj_id`** | Sample ID | Unique identifier for each viral sequence. | | **`obj_type`** | Sequence Type | Set to `gene` for all entries in this dataset (Nucleotide). | | **`obj_seq`** | Sequence Content | Raw nucleotide string (A, T(U), C, G, N). | | **`obj_label`** | Label | Metadata, taxonomic info, or functional labels associated with the genome (Annotation, Bio Knowledge). | ## 4. Intended Use - **Genomic Foundation Modeling**: Building models like **LucaVirus-Gene** that specialize in the "language of genomes." - **Viral Evolution Studies**: Analyzing conserved nucleotide patterns across divergent viral lineages. - **Regulatory Element Discovery**: Identifying viral gene boundaries, promoters, and other non-coding functional motifs. ## 5. Usage Example You can extract the archive and load the genomic data using the following Python snippet: ```python import tarfile import pandas as pd import os # 1. Extract the genomic dataset with tarfile.open("LucaVirus-OpenVirus-Gene.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene") with tarfile.open("LucaVirus-OpenVirus-Gene/dataset.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene/dataset") # 2. Load a sample from the training set train_path = "./LucaVirus-OpenVirus-Gene/dataset/v1.0/train" csv_files = [f for f in os.listdir(train_path) if f.endswith('.csv')] if csv_files: # Load the first CSV file df = pd.read_csv(os.path.join(train_path, csv_files[0])) # Verify the sequence type print(f"Loaded {len(df)} genomic sequences.") print(df[['obj_id', 'obj_seq']].head()) ``` ## 6. Related Resources This dataset is a core component of the **LucaGroup** biological modeling ecosystem. - **Full Corpus (Gene + Prot)**: [LucaVirus-OpenVirus-Gene-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot) - **Protein Subset**: [LucaVirus-OpenVirus-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot) - **Models**: Visit the [LucaVirus Collection](https://huggingface.co/collections/LucaGroup/lucavirus). - ## 7. Citation If you use this dataset in your research, please cite: ```bibtex @article{lucavirus2025, title={Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus.}, author={Pan, Yuan-Fei* and He, Yong*. et al.}, journal={bioRxiv}, year={2025}, url={https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722} } ``` ## 8. License This dataset is released under the **MIT License**. ## 9. Contact *For further information, please visit the [LucaGroup GitHub](https://github.com/LucaOne), email to: [YongHe: sanyuan.hy@alibaba-inc.com, heyongcsat@gmail.com], or contact the team via the Hugging Face organization profile.*

--- language: - 英语 license: MIT许可证 tags: - 生物学 - 生物信息学 - 病毒学 - 基因组学 - 蛋白质组学 - 核苷酸 - 蛋白质 - 基础模型(Foundation Model) - LucaVirus - LucaVirus-Gene - AI4Bio - AI4Science - Nucleotide-Protein task_categories: - 特征提取 size_categories: - 1000万 < 样本数 < 1亿 --- # LucaVirus-OpenVirus-Gene 数据集卡片 ## 1. 数据集概述 **LucaVirus-OpenVirus-Gene** 是一款仅包含病毒核苷酸序列的大规模基因组数据集,属于**OpenVirus**语料库的专属子集,专为**LucaVirus-Gene**基础模型的预训练工作打造。 本数据集聚焦病毒基因组领域,收录总计**1040万**条序列,构建为高密度专业语料库,可助力模型精准捕捉DNA与RNA病毒的复杂进化模式、调控基序及基因组结构特征。 ## 2. 数据集统计信息 本数据集仅涵盖核苷酸序列(包含基因组、基因及片段): | 特征 | 计数/描述 | | :--- | :--- | | **总序列数** | 1040万 | | **序列类型** | 核苷酸(DNA/RNA) | | **`obj_type` 标识符** | 仅为`gene` | | **核心用途** | LucaVirus-Gene 的预训练 | ## 3. 数据结构与格式 ### 3.1 文件组织方式 本数据集以压缩**`.tar`**归档格式提供。解压后,数据被划分为三个标准机器学习子集: text LucaVirus-OpenVirus-Gene/dataset/v1.0/ ├── train/ # 训练集(基因组预训练的核心语料) ├── dev/ # 验证集(用于模型选型与调优) └── test/ # 测试集(用于最终评估与基准测试) 每个目录(`train`、`dev`、`test`)均包含一个或多个带表头的**CSV文件**。 ### 3.2 CSV文件结构规范 所有CSV文件均遵循统一的四列格式: | 列名 | 描述 | 详细说明 | | :--- | :--- |:-------------------------------------------------------------------------------------------------------| | **`obj_id`** | 样本ID | 每条病毒序列的唯一标识符 | | **`obj_type`** | 序列类型 | 本数据集所有条目均为`gene`(核苷酸类型) | | **`obj_seq`** | 序列内容 | 原始核苷酸字符串(支持A、T(U)、C、G、N) | | **`obj_label`** | 标签 | 与基因组关联的元数据、分类学信息或功能标签(包含注释、生物知识) | ## 4. 预期应用场景 - **基因组基础模型构建**:开发专注于“基因组语言”的专业模型,例如**LucaVirus-Gene** - **病毒进化研究**:分析不同病毒分支间保守的核苷酸模式 - **调控元件挖掘**:识别病毒基因边界、启动子及其他非编码功能基序 ## 5. 使用示例 可通过以下Python代码片段解压并加载基因组数据: python import tarfile import pandas as pd import os # 1. 解压基因组数据集 with tarfile.open("LucaVirus-OpenVirus-Gene.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene") with tarfile.open("LucaVirus-OpenVirus-Gene/dataset.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene/dataset") # 2. 加载训练集样本 train_path = "./LucaVirus-OpenVirus-Gene/dataset/v1.0/train" csv_files = [f for f in os.listdir(train_path) if f.endswith('.csv')] if csv_files: # 加载首个CSV文件 df = pd.read_csv(os.path.join(train_path, csv_files[0])) # 验证序列类型 print(f"Loaded {len(df)} genomic sequences.") print(df[['obj_id', 'obj_seq']].head()) ## 6. 相关资源 本数据集是**LucaGroup**生物建模生态系统的核心组成部分: - **完整语料库(基因+蛋白质)**:[LucaVirus-OpenVirus-Gene-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot) - **蛋白质子集**:[LucaVirus-OpenVirus-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot) - **模型集合**:访问 [LucaVirus 模型集](https://huggingface.co/collections/LucaGroup/lucavirus)。 ## 7. 引用规范 若您在研究中使用本数据集,请引用以下文献: bibtex @article{lucavirus2025, title={Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus.}, author={Pan, Yuan-Fei* and He, Yong*. et al.}, journal={bioRxiv}, year={2025}, url={https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722} } ## 8. 许可证 本数据集采用**MIT许可证**发布。 ## 9. 联系方式 如需进一步信息,请访问 [LucaGroup GitHub 页面](https://github.com/LucaOne),发送邮件至:[YongHe: sanyuan.hy@alibaba-inc.com, heyongcsat@gmail.com],或通过 Hugging Face 组织主页联系团队。
提供机构:
LucaGroup
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作