LucaGroup/LucaVirus-OpenVirus-Gene

Name: LucaGroup/LucaVirus-OpenVirus-Gene
Creator: LucaGroup
Published: 2026-01-01 06:44:06
License: 暂无描述

Hugging Face2026-01-01 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LucaGroup/LucaVirus-OpenVirus-Gene

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit tags: - Biology - Bioinformatics - Virus - Genomics - Proteomics - Nucleotide - Protein - Foundation Model - LucaVirus - LucaVirus-Gene - AI4Bio - AI4Science - Nucleotide-Protein task_categories: - feature-extraction size_categories: - 10M<n<100M --- # Dataset Card for LucaVirus-OpenVirus-Gene ## 1. Dataset Summary **LucaVirus-OpenVirus-Gene** is a large-scale genomic dataset consisting exclusively of viral nucleotide sequences. It is a specialized subset of the **OpenVirus** corpus, curated specifically for the pre-training of the **LucaVirus-Gene** foundation model. By focusing purely on viral genomes, this dataset provides a high-density corpus of **10.4 million** sequences, enabling models to capture the intricate evolutionary patterns, regulatory motifs, and genomic architectures of DNA and RNA viruses. ## 2. Dataset Statistics The dataset focuses solely on nucleotide sequences (genomes, genes, and fragments): | Feature | Count / Description | | :--- | :--- | | **Total Sequences** | 10.4 Million | | **Sequence Type** | Nucleotide (DNA/RNA) | | **`obj_type` Identifier** | `gene` (Exclusive) | | **Primary Use** | Pre-training for LucaVirus-Gene | ## 3. Data Structure & Format ### 3.1 File Organization The dataset is provided as a compressed **`.tar`** archive. Upon extraction, the data is partitioned into three standard machine-learning subsets: ```text LucaVirus-OpenVirus-Gene/dataset/v1.0/ ├── train/ # Training set (primary corpus for genomic pre-training) ├── dev/ # Validation set (for model selection and tuning) └── test/ # Test set (for final evaluation and benchmarking) ``` Each directory (`train`, `dev`, `test`) contains one or more **CSV files** with headers. ### 3.2 CSV Schema All CSV files follow a consistent four-column schema: | Column Name | Description | Details | | :--- | :--- |:-------------------------------------------------------------------------------------------------------| | **`obj_id`** | Sample ID | Unique identifier for each viral sequence. | | **`obj_type`** | Sequence Type | Set to `gene` for all entries in this dataset (Nucleotide). | | **`obj_seq`** | Sequence Content | Raw nucleotide string (A, T(U), C, G, N). | | **`obj_label`** | Label | Metadata, taxonomic info, or functional labels associated with the genome (Annotation, Bio Knowledge). | ## 4. Intended Use - **Genomic Foundation Modeling**: Building models like **LucaVirus-Gene** that specialize in the "language of genomes." - **Viral Evolution Studies**: Analyzing conserved nucleotide patterns across divergent viral lineages. - **Regulatory Element Discovery**: Identifying viral gene boundaries, promoters, and other non-coding functional motifs. ## 5. Usage Example You can extract the archive and load the genomic data using the following Python snippet: ```python import tarfile import pandas as pd import os # 1. Extract the genomic dataset with tarfile.open("LucaVirus-OpenVirus-Gene.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene") with tarfile.open("LucaVirus-OpenVirus-Gene/dataset.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene/dataset") # 2. Load a sample from the training set train_path = "./LucaVirus-OpenVirus-Gene/dataset/v1.0/train" csv_files = [f for f in os.listdir(train_path) if f.endswith('.csv')] if csv_files: # Load the first CSV file df = pd.read_csv(os.path.join(train_path, csv_files[0])) # Verify the sequence type print(f"Loaded {len(df)} genomic sequences.") print(df[['obj_id', 'obj_seq']].head()) ``` ## 6. Related Resources This dataset is a core component of the **LucaGroup** biological modeling ecosystem. - **Full Corpus (Gene + Prot)**: [LucaVirus-OpenVirus-Gene-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot) - **Protein Subset**: [LucaVirus-OpenVirus-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot) - **Models**: Visit the [LucaVirus Collection](https://huggingface.co/collections/LucaGroup/lucavirus). - ## 7. Citation If you use this dataset in your research, please cite: ```bibtex @article{lucavirus2025, title={Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus.}, author={Pan, Yuan-Fei* and He, Yong*. et al.}, journal={bioRxiv}, year={2025}, url={https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722} } ``` ## 8. License This dataset is released under the **MIT License**. ## 9. Contact *For further information, please visit the [LucaGroup GitHub](https://github.com/LucaOne), email to: [YongHe: sanyuan.hy@alibaba-inc.com, heyongcsat@gmail.com], or contact the team via the Hugging Face organization profile.*

--- language: - 英语 license: MIT许可证 tags: - 生物学 - 生物信息学 - 病毒学 - 基因组学 - 蛋白质组学 - 核苷酸 - 蛋白质 - 基础模型（Foundation Model） - LucaVirus - LucaVirus-Gene - AI4Bio - AI4Science - Nucleotide-Protein task_categories: - 特征提取 size_categories: - 1000万 < 样本数 < 1亿 --- # LucaVirus-OpenVirus-Gene 数据集卡片 ## 1. 数据集概述 **LucaVirus-OpenVirus-Gene** 是一款仅包含病毒核苷酸序列的大规模基因组数据集，属于**OpenVirus**语料库的专属子集，专为**LucaVirus-Gene**基础模型的预训练工作打造。本数据集聚焦病毒基因组领域，收录总计**1040万**条序列，构建为高密度专业语料库，可助力模型精准捕捉DNA与RNA病毒的复杂进化模式、调控基序及基因组结构特征。 ## 2. 数据集统计信息本数据集仅涵盖核苷酸序列（包含基因组、基因及片段）： | 特征 | 计数/描述 | | :--- | :--- | | **总序列数** | 1040万 | | **序列类型** | 核苷酸（DNA/RNA） | | **`obj_type` 标识符** | 仅为`gene` | | **核心用途** | LucaVirus-Gene 的预训练 | ## 3. 数据结构与格式 ### 3.1 文件组织方式本数据集以压缩**`.tar`**归档格式提供。解压后，数据被划分为三个标准机器学习子集： text LucaVirus-OpenVirus-Gene/dataset/v1.0/ ├── train/ # 训练集（基因组预训练的核心语料） ├── dev/ # 验证集（用于模型选型与调优） └── test/ # 测试集（用于最终评估与基准测试）每个目录（`train`、`dev`、`test`）均包含一个或多个带表头的**CSV文件**。 ### 3.2 CSV文件结构规范所有CSV文件均遵循统一的四列格式： | 列名 | 描述 | 详细说明 | | :--- | :--- |:-------------------------------------------------------------------------------------------------------| | **`obj_id`** | 样本ID | 每条病毒序列的唯一标识符 | | **`obj_type`** | 序列类型 | 本数据集所有条目均为`gene`（核苷酸类型） | | **`obj_seq`** | 序列内容 | 原始核苷酸字符串（支持A、T(U)、C、G、N） | | **`obj_label`** | 标签 | 与基因组关联的元数据、分类学信息或功能标签（包含注释、生物知识） | ## 4. 预期应用场景 - **基因组基础模型构建**：开发专注于“基因组语言”的专业模型，例如**LucaVirus-Gene** - **病毒进化研究**：分析不同病毒分支间保守的核苷酸模式 - **调控元件挖掘**：识别病毒基因边界、启动子及其他非编码功能基序 ## 5. 使用示例可通过以下Python代码片段解压并加载基因组数据： python import tarfile import pandas as pd import os # 1. 解压基因组数据集 with tarfile.open("LucaVirus-OpenVirus-Gene.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene") with tarfile.open("LucaVirus-OpenVirus-Gene/dataset.tar.gz", "r:gz") as tar: tar.extractall(path="./LucaVirus-OpenVirus-Gene/dataset") # 2. 加载训练集样本 train_path = "./LucaVirus-OpenVirus-Gene/dataset/v1.0/train" csv_files = [f for f in os.listdir(train_path) if f.endswith('.csv')] if csv_files: # 加载首个CSV文件 df = pd.read_csv(os.path.join(train_path, csv_files[0])) # 验证序列类型 print(f"Loaded {len(df)} genomic sequences.") print(df[['obj_id', 'obj_seq']].head()) ## 6. 相关资源本数据集是**LucaGroup**生物建模生态系统的核心组成部分： - **完整语料库（基因+蛋白质）**：[LucaVirus-OpenVirus-Gene-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot) - **蛋白质子集**：[LucaVirus-OpenVirus-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot) - **模型集合**：访问 [LucaVirus 模型集](https://huggingface.co/collections/LucaGroup/lucavirus)。 ## 7. 引用规范若您在研究中使用本数据集，请引用以下文献： bibtex @article{lucavirus2025, title={Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus.}, author={Pan, Yuan-Fei* and He, Yong*. et al.}, journal={bioRxiv}, year={2025}, url={https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722} } ## 8. 许可证本数据集采用**MIT许可证**发布。 ## 9. 联系方式如需进一步信息，请访问 [LucaGroup GitHub 页面](https://github.com/LucaOne)，发送邮件至：[YongHe: sanyuan.hy@alibaba-inc.com, heyongcsat@gmail.com]，或通过 Hugging Face 组织主页联系团队。

提供机构：

LucaGroup

5,000+

优质数据集

54 个

任务类型

进入经典数据集