The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10570448

下载链接

链接失效反馈

官方服务：

资源简介：

This data continues with the development of the NPEGC Trinity de novo metatranscriptome assemblies from the protein data repository of The North Pacific Eukaryotic Gene Catalog. The nucleotide sequences corresponding to the NPEGC cluster representatives are collected together in these repository files:NPac.G1PA.bf100.id99.nt.fasta.gzNPac.G2PA.bf100.id99.nt.fasta.gzNPac.G3PA.bf100.id99.nt.fasta.gzNPac.G3PA_diel.bf100.id99.nt.fasta.gzNPac.D1PA.bf100.id99.nt.fasta.gzA full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.These nucleotide sequences have been sourced from the Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3 Key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog Code used to build the kallisto indices and map the short reads against indices with kallisto are online in the code repository here: NPEGC.nt_kallisto_counts.shThere are two main steps:1. Generate the kallisto index on the sets of clustered nucleotide metatranscripts2. Map the short reads from environmental samples back to the assembly index As generated above, kallisto generates separate results files for each of the sample files. Even after compression, the total size of the tarballed kallisto output results directories are prohibitively large (>50GB). We use the code in this template R script to join together the 'est_count' estimated count values for the tens of millions of protein sequences in each project metatranscriptome, along with length. The code in this template script was used for each project: aggregate_kallisto_counts.RThe output count files for each project are Gzip-compressed and uploaded to the NPEGC nucleotide data repository here: G1PA.raw.est_counts.csv.gzG2PA.raw.est_counts.csv.gzG3PA.raw.est_counts.csv.gzG3PA_diel.raw.est_counts.csv.gzD1PA.raw.est_counts.csv.gz

创建时间：

2025-01-22