LucaGroup/LucaVirus-OpenVirus-Gene
收藏Hugging Face2026-01-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LucaGroup/LucaVirus-OpenVirus-Gene
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
tags:
- Biology
- Bioinformatics
- Virus
- Genomics
- Proteomics
- Nucleotide
- Protein
- Foundation Model
- LucaVirus
- LucaVirus-Gene
- AI4Bio
- AI4Science
- Nucleotide-Protein
task_categories:
- feature-extraction
size_categories:
- 10M<n<100M
---
# Dataset Card for LucaVirus-OpenVirus-Gene
## 1. Dataset Summary
**LucaVirus-OpenVirus-Gene** is a large-scale genomic dataset consisting exclusively of viral nucleotide sequences. It is a specialized subset of the **OpenVirus** corpus, curated specifically for the pre-training of the **LucaVirus-Gene** foundation model.
By focusing purely on viral genomes, this dataset provides a high-density corpus of **10.4 million** sequences, enabling models to capture the intricate evolutionary patterns, regulatory motifs, and genomic architectures of DNA and RNA viruses.
## 2. Dataset Statistics
The dataset focuses solely on nucleotide sequences (genomes, genes, and fragments):
| Feature | Count / Description |
| :--- | :--- |
| **Total Sequences** | 10.4 Million |
| **Sequence Type** | Nucleotide (DNA/RNA) |
| **`obj_type` Identifier** | `gene` (Exclusive) |
| **Primary Use** | Pre-training for LucaVirus-Gene |
## 3. Data Structure & Format
### 3.1 File Organization
The dataset is provided as a compressed **`.tar`** archive. Upon extraction, the data is partitioned into three standard machine-learning subsets:
```text
LucaVirus-OpenVirus-Gene/dataset/v1.0/
├── train/ # Training set (primary corpus for genomic pre-training)
├── dev/ # Validation set (for model selection and tuning)
└── test/ # Test set (for final evaluation and benchmarking)
```
Each directory (`train`, `dev`, `test`) contains one or more **CSV files** with headers.
### 3.2 CSV Schema
All CSV files follow a consistent four-column schema:
| Column Name | Description | Details |
| :--- | :--- |:-------------------------------------------------------------------------------------------------------|
| **`obj_id`** | Sample ID | Unique identifier for each viral sequence. |
| **`obj_type`** | Sequence Type | Set to `gene` for all entries in this dataset (Nucleotide). |
| **`obj_seq`** | Sequence Content | Raw nucleotide string (A, T(U), C, G, N). |
| **`obj_label`** | Label | Metadata, taxonomic info, or functional labels associated with the genome (Annotation, Bio Knowledge). |
## 4. Intended Use
- **Genomic Foundation Modeling**: Building models like **LucaVirus-Gene** that specialize in the "language of genomes."
- **Viral Evolution Studies**: Analyzing conserved nucleotide patterns across divergent viral lineages.
- **Regulatory Element Discovery**: Identifying viral gene boundaries, promoters, and other non-coding functional motifs.
## 5. Usage Example
You can extract the archive and load the genomic data using the following Python snippet:
```python
import tarfile
import pandas as pd
import os
# 1. Extract the genomic dataset
with tarfile.open("LucaVirus-OpenVirus-Gene.tar.gz", "r:gz") as tar:
tar.extractall(path="./LucaVirus-OpenVirus-Gene")
with tarfile.open("LucaVirus-OpenVirus-Gene/dataset.tar.gz", "r:gz") as tar:
tar.extractall(path="./LucaVirus-OpenVirus-Gene/dataset")
# 2. Load a sample from the training set
train_path = "./LucaVirus-OpenVirus-Gene/dataset/v1.0/train"
csv_files = [f for f in os.listdir(train_path) if f.endswith('.csv')]
if csv_files:
# Load the first CSV file
df = pd.read_csv(os.path.join(train_path, csv_files[0]))
# Verify the sequence type
print(f"Loaded {len(df)} genomic sequences.")
print(df[['obj_id', 'obj_seq']].head())
```
## 6. Related Resources
This dataset is a core component of the **LucaGroup** biological modeling ecosystem.
- **Full Corpus (Gene + Prot)**: [LucaVirus-OpenVirus-Gene-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot)
- **Protein Subset**: [LucaVirus-OpenVirus-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot)
- **Models**: Visit the [LucaVirus Collection](https://huggingface.co/collections/LucaGroup/lucavirus).
-
## 7. Citation
If you use this dataset in your research, please cite:
```bibtex
@article{lucavirus2025,
title={Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus.},
author={Pan, Yuan-Fei* and He, Yong*. et al.},
journal={bioRxiv},
year={2025},
url={https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722}
}
```
## 8. License
This dataset is released under the **MIT License**.
## 9. Contact
*For further information, please visit the [LucaGroup GitHub](https://github.com/LucaOne), email to: [YongHe: sanyuan.hy@alibaba-inc.com, heyongcsat@gmail.com], or contact the team via the Hugging Face organization profile.*
---
language:
- 英语
license: MIT许可证
tags:
- 生物学
- 生物信息学
- 病毒学
- 基因组学
- 蛋白质组学
- 核苷酸
- 蛋白质
- 基础模型(Foundation Model)
- LucaVirus
- LucaVirus-Gene
- AI4Bio
- AI4Science
- Nucleotide-Protein
task_categories:
- 特征提取
size_categories:
- 1000万 < 样本数 < 1亿
---
# LucaVirus-OpenVirus-Gene 数据集卡片
## 1. 数据集概述
**LucaVirus-OpenVirus-Gene** 是一款仅包含病毒核苷酸序列的大规模基因组数据集,属于**OpenVirus**语料库的专属子集,专为**LucaVirus-Gene**基础模型的预训练工作打造。
本数据集聚焦病毒基因组领域,收录总计**1040万**条序列,构建为高密度专业语料库,可助力模型精准捕捉DNA与RNA病毒的复杂进化模式、调控基序及基因组结构特征。
## 2. 数据集统计信息
本数据集仅涵盖核苷酸序列(包含基因组、基因及片段):
| 特征 | 计数/描述 |
| :--- | :--- |
| **总序列数** | 1040万 |
| **序列类型** | 核苷酸(DNA/RNA) |
| **`obj_type` 标识符** | 仅为`gene` |
| **核心用途** | LucaVirus-Gene 的预训练 |
## 3. 数据结构与格式
### 3.1 文件组织方式
本数据集以压缩**`.tar`**归档格式提供。解压后,数据被划分为三个标准机器学习子集:
text
LucaVirus-OpenVirus-Gene/dataset/v1.0/
├── train/ # 训练集(基因组预训练的核心语料)
├── dev/ # 验证集(用于模型选型与调优)
└── test/ # 测试集(用于最终评估与基准测试)
每个目录(`train`、`dev`、`test`)均包含一个或多个带表头的**CSV文件**。
### 3.2 CSV文件结构规范
所有CSV文件均遵循统一的四列格式:
| 列名 | 描述 | 详细说明 |
| :--- | :--- |:-------------------------------------------------------------------------------------------------------|
| **`obj_id`** | 样本ID | 每条病毒序列的唯一标识符 |
| **`obj_type`** | 序列类型 | 本数据集所有条目均为`gene`(核苷酸类型) |
| **`obj_seq`** | 序列内容 | 原始核苷酸字符串(支持A、T(U)、C、G、N) |
| **`obj_label`** | 标签 | 与基因组关联的元数据、分类学信息或功能标签(包含注释、生物知识) |
## 4. 预期应用场景
- **基因组基础模型构建**:开发专注于“基因组语言”的专业模型,例如**LucaVirus-Gene**
- **病毒进化研究**:分析不同病毒分支间保守的核苷酸模式
- **调控元件挖掘**:识别病毒基因边界、启动子及其他非编码功能基序
## 5. 使用示例
可通过以下Python代码片段解压并加载基因组数据:
python
import tarfile
import pandas as pd
import os
# 1. 解压基因组数据集
with tarfile.open("LucaVirus-OpenVirus-Gene.tar.gz", "r:gz") as tar:
tar.extractall(path="./LucaVirus-OpenVirus-Gene")
with tarfile.open("LucaVirus-OpenVirus-Gene/dataset.tar.gz", "r:gz") as tar:
tar.extractall(path="./LucaVirus-OpenVirus-Gene/dataset")
# 2. 加载训练集样本
train_path = "./LucaVirus-OpenVirus-Gene/dataset/v1.0/train"
csv_files = [f for f in os.listdir(train_path) if f.endswith('.csv')]
if csv_files:
# 加载首个CSV文件
df = pd.read_csv(os.path.join(train_path, csv_files[0]))
# 验证序列类型
print(f"Loaded {len(df)} genomic sequences.")
print(df[['obj_id', 'obj_seq']].head())
## 6. 相关资源
本数据集是**LucaGroup**生物建模生态系统的核心组成部分:
- **完整语料库(基因+蛋白质)**:[LucaVirus-OpenVirus-Gene-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Gene-Prot)
- **蛋白质子集**:[LucaVirus-OpenVirus-Prot](https://huggingface.co/datasets/LucaGroup/LucaVirus-OpenVirus-Prot)
- **模型集合**:访问 [LucaVirus 模型集](https://huggingface.co/collections/LucaGroup/lucavirus)。
## 7. 引用规范
若您在研究中使用本数据集,请引用以下文献:
bibtex
@article{lucavirus2025,
title={Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus.},
author={Pan, Yuan-Fei* and He, Yong*. et al.},
journal={bioRxiv},
year={2025},
url={https://www.biorxiv.org/content/early/2025/06/20/2025.06.14.659722}
}
## 8. 许可证
本数据集采用**MIT许可证**发布。
## 9. 联系方式
如需进一步信息,请访问 [LucaGroup GitHub 页面](https://github.com/LucaOne),发送邮件至:[YongHe: sanyuan.hy@alibaba-inc.com, heyongcsat@gmail.com],或通过 Hugging Face 组织主页联系团队。
提供机构:
LucaGroup



