下载链接：

https://modelscope.cn/datasets/lgq12697/nucleotide_transformer_downstream_tasks

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Dataset Name The `nucleotide_transformer_downstream_tasks` dataset features the 18 downstream tasks presented in the [Nucleotide Transformer paper](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3). They consist of both binary and multi-class classification tasks that aim at providing a consistent genomics benchmark. **We note that this is an updated version of this benchmark after the paper has been through peer-review. We highly encourage to move to this version in detriment of the [older version](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks).** Keypoints about the updated datasets: 1. It maintains similar tasks, number of classes and train and test set sizes as the previous benchmark. 2. All the data is now from human samples and from recent and high-quality datasets (see descriptions of each dataset source below). 3. Randomly generated negative samples have been replaced by existing chunks of genomes not containing the respective elements. 4. All datasets have now proper chromosome held-out test sets. ## Dataset Description - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) - **Paper:** [The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3) ### Dataset Summary The different datasets are collected from the following resources: - [ENCODE](https://www.encodeproject.org/): Histone ChIP-seq data for 10 histone marks in the K562 human cell line were obtained from ENCODE. We downloaded bed narrowPeak files with the following identifiers: `H3K4me3` (ENCFF706WUF), `H3K27ac` (ENCFF544LXB), `H3K27me3` (ENCFF323WOT), `H3K4me1` (ENCFF135ZLM), `H3K36me3` (ENCFF561OUZ), `H3K9me3` (ENCFF963GZJ), `H3K9ac` (ENCFF891CHI), `H3K4me2` (ENCFF749KLQ), `H4K20me1` (ENCFF909RKY), `H2AFZ` (ENCFF213OTI). For each dataset, we selected 1kb genomic sequences containing peaks as positive examples and all 1kb sequences not overlapping peaks as negative examples. - [Screen](https://screen.wenglab.org/): Human enhancer elements were retrieved from ENCODE's SCREEN database. Distal and proximal enhancers were combined. Enhancers were split in tissue-specific and tissue-invariant based on the vocabulary from [Meuleman et al.](https://www.meuleman.org/research/dhsindex/). Enhancers overlapping regions classified as tissue-invariant were defined as that, while all other enhancers were defined as tissue-specific. We selected 400bp genomic sequences containing enhancers as positive examples and all 400bp sequences not overlapping enhancers as negative examples. We created a binary classification task for the presence of enhancer elements in the sequence (`Enhancer`) and a multi-label prediction task with labels being tissue-specific enhancer, tissue-invariant enhancer or none (`Enhancer types`). - [Eukaryotic Promoter Database](https://epd.expasy.org/epd/): We downloaded all human promoters from the Eukaryotic Promoter Database, spanning 49bp upstream and 10bp downstream of transcription start sites ([file](https://epd.expasy.org/ftp/epdnew/H_sapiens/006/Hs_EPDnew_006_hg38.bed)). This resulted in 29,598 promoter regions, 3,065 of which were TATA-box promoters (using the motif annotation at [https://epd.expasy.org/ftp/epdnew/H_sapiens/006/db/promoter_motifs.txt](https://epd.expasy.org/ftp/epdnew/H_sapiens/006/db/promoter_motifs.txt)). We selected 300bp genomic sequences containing promoters as positive examples and all 300bp sequences not overlapping promoters as negative examples. These positive and negative examples were used to create three different binary classification tasks: presence of any promoter element (`Promoter all`), a promoter with a TATA-box motif (`Promoter TATA`) or a promoter without a TATA-box motif (`Promoter no-TATA`). - [GENCODE](https://www.gencodegenes.org/human/release_44.html): We obtained all human annotated splice sites from GENCODE V44 gene annotation. Annotations were filtered to exclude level 3 transcripts (automated annotation), so all training data was annotated by a human. ## Dataset Structure ``` | Task | Number of train sequences | Number of test sequences | Number of labels | Sequence length | | --------------------- | ------------------------- | ------------------------ | ---------------- | --------------- | | promoter_all | 30,000 | 1,584 | 2 | 300 | | promoter_tata | 5,062 | 212 | 2 | 300 | | promoter_no_tata | 30,000 | 1,372 | 2 | 300 | | enhancers | 30,000 | 3,000 | 2 | 400 | | enhancers_types | 30,000 | 3,000 | 3 | 400 | | splice_sites_all | 30,000 | 3,000 | 3 | 600 | | splice_sites_acceptor | 30,000 | 3,000 | 2 | 600 | | splice_sites_donor | 30,000 | 3,000 | 2 | 600 | | H2AFZ | 30,000 | 3,000 | 2 | 1,000 | | H3K27ac | 30,000 | 1,616 | 2 | 1,000 | | H3K27me3 | 30,000 | 3,000 | 2 | 1,000 | | H3K36me3 | 30,000 | 3,000 | 2 | 1,000 | | H3K4me1 | 30,000 | 3,000 | 2 | 1,000 | | H3K4me2 | 30,000 | 2,138 | 2 | 1,000 | | H3K4me3 | 30,000 | 776 | 2 | 1,000 | | H3K9ac | 23,274 | 1,004 | 2 | 1,000 | | H3K9me3 | 27,438 | 850 | 2 | 1,000 | | H4K20me1 | 30,000 | 2,270 | 2 | 1,000 | ```

# 数据集卡片：`nucleotide_transformer_downstream_tasks` `nucleotide_transformer_downstream_tasks` 数据集涵盖了《核苷酸Transformer（Nucleotide Transformer）》论文中提出的18项下游任务，该数据集包含二分类与多分类两类任务，旨在构建一套标准化的基因组学基准测试集。 **请注意，本基准是该论文经过同行评议后的更新版本。我们强烈建议使用此更新版本，而非旧版数据集（[旧版链接](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks)）。** 更新后数据集的核心要点： 1. 保留了与旧版基准一致的任务类型、类别数量以及训练集与测试集规模。 2. 所有数据均源自人类样本，且采用了最新的高质量数据集（详见下文各数据集来源说明）。 3. 原随机生成的负样本已替换为不含对应靶元素的真实基因组片段。 4. 所有数据集均已采用基于染色体划分的标准留出测试集。 ## 数据集说明 - **代码仓库：** [核苷酸Transformer（Nucleotide Transformer）](https://github.com/instadeepai/nucleotide-transformer) - **相关论文：** [《核苷酸Transformer：构建并评估面向人类基因组学的鲁棒基础模型》](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v3) ### 数据集概览本数据集包含的子数据集均源自以下公开资源： - [ENCODE（DNA元件百科全书，Encyclopedia of DNA Elements）](https://www.encodeproject.org/)：本数据集的人类K562细胞系中10种组蛋白修饰的染色质免疫共沉淀测序（ChIP-seq）数据均取自ENCODE数据库。我们下载了带有以下标识符的BED窄峰（narrowPeak）文件：`H3K4me3`（ENCFF706WUF）、`H3K27ac`（ENCFF544LXB）、`H3K27me3`（ENCFF323WOT）、`H3K4me1`（ENCFF135ZLM）、`H3K36me3`（ENCFF561OUZ）、`H3K9me3`（ENCFF963GZJ）、`H3K9ac`（ENCFF891CHI）、`H3K4me2`（ENCFF749KLQ）、`H4K20me1`（ENCFF909RKY）、`H2AFZ`（ENCFF213OTI）。针对每个数据集，我们选取包含峰区域的1kb基因组序列作为正样本，所有不与峰区域重叠的1kb序列作为负样本。 - [Screen](https://screen.wenglab.org/)：人类增强子元件数据取自ENCODE的SCREEN数据库。我们将远端增强子与近端增强子合并，并依据[Meuleman等人](https://www.meuleman.org/research/dhsindex/)提出的分类体系，将增强子划分为组织特异性与组织非特异性两类。与组织非特异性区域重叠的增强子被归为组织非特异性增强子，其余增强子则归为组织特异性增强子。我们选取包含增强子的400bp基因组序列作为正样本，所有不与增强子区域重叠的400bp序列作为负样本。据此构建了两项任务：一是针对序列中是否存在增强子元件的二分类任务（`Enhancer`），二是以组织特异性增强子、组织非特异性增强子或无增强子为标签的多标签分类任务（`Enhancer types`）。 - [真核启动子数据库（Eukaryotic Promoter Database，EPD）](https://epd.expasy.org/epd/)：我们从真核启动子数据库下载了所有人类启动子序列，覆盖转录起始位点（TSS）上游49bp与下游10bp的区域（[数据文件](https://epd.expasy.org/ftp/epdnew/H_sapiens/006/Hs_EPDnew_006_hg38.bed)）。最终共获得29598个启动子区域，其中3065个为TATA盒（TATA-box）启动子（依据[https://epd.expasy.org/ftp/epdnew/H_sapiens/006/db/promoter_motifs.txt](https://epd.expasy.org/ftp/epdnew/H_sapiens/006/db/promoter_motifs.txt)中的基序注释）。我们选取包含启动子的300bp基因组序列作为正样本，所有不与启动子区域重叠的300bp序列作为负样本。基于上述正负样本构建了三项不同的二分类任务：一是检测序列中是否存在任意启动子元件（`Promoter all`），二是检测序列中是否存在携带TATA盒基序的启动子（`Promoter TATA`），三是检测序列中是否存在不携带TATA盒基序的启动子（`Promoter no-TATA`）。 - [GENCODE](https://www.gencodegenes.org/human/release_44.html)：我们从GENCODE V44基因注释文件中获取了所有人类注释剪接位点。我们过滤掉了等级3的转录本（即自动化注释结果），因此所有训练数据均为人工注释的结果。 ## 数据集结构 | 任务名称 | 训练序列数 | 测试序列数 | 标签数 | 序列长度 | | --------------------- | ------------------------- | ------------------------ | ---------------- | --------------- | | `promoter_all`（全启动子） | 30,000 | 1,584 | 2 | 300 | | `promoter_tata`（TATA盒启动子） | 5,062 | 212 | 2 | 300 | | `promoter_no_tata`（非TATA盒启动子） | 30,000 | 1,372 | 2 | 300 | | `enhancers`（增强子识别） | 30,000 | 3,000 | 2 | 400 | | `enhancers_types`（增强子类型分类） | 30,000 | 3,000 | 3 | 400 | | `splice_sites_all`（全剪接位点识别） | 30,000 | 3,000 | 3 | 600 | | `splice_sites_acceptor`（受体剪接位点识别） | 30,000 | 3,000 | 2 | 600 | | `splice_sites_donor`（供体剪接位点识别） | 30,000 | 3,000 | 2 | 600 | | `H2AFZ` | 30,000 | 3,000 | 2 | 1,000 | | `H3K27ac` | 30,000 | 1,616 | 2 | 1,000 | | `H3K27me3` | 30,000 | 3,000 | 2 | 1,000 | | `H3K36me3` | 30,000 | 3,000 | 2 | 1,000 | | `H3K4me1` | 30,000 | 3,000 | 2 | 1,000 | | `H3K4me2` | 30,000 | 2,138 | 2 | 1,000 | | `H3K4me3` | 30,000 | 776 | 2 | 1,000 | | `H3K9ac` | 23,274 | 1,004 | 2 | 1,000 | | `H3K9me3` | 27,438 | 850 | 2 | 1,000 | | `H4K20me1` | 30,000 | 2,270 | 2 | 1,000 |

应用场景：