Transcriptomics data for CCLE, NCI-60, and PDAC mouse data
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10785228
下载链接
链接失效反馈官方服务:
资源简介:
Counts, lengths, TPM, and FPKM per gene and per transcript. All CCLE and NCI-60 cell lines are specified as Cellosaurus IDs.
Everything was re-processed with the nf-core/rnaseq pipeline (version 3.10.1) in the setting STAR/Salmon. For human fastq files (CCLE, NCI-60), GRCh38 was used, for mouse GRCm39.
CCLE (1019 cell lines)
Raw fastq files were downloaded from the NCBI SRA Run selector as BioProject PRJNA523380 using the SRA toolkit. Sequences were first prefetched and then the fastq files were generated with:
```{bash}
#!/bin/bash
while read run
do
echo $run
fasterq-dump $run
echo gzipping
gzip $run*.fastq
done < SRR_Acc_List_CCLE.txt
```
Then, the directories were deleted.
Afterwards, the FASTQ files were processed using the nf-core/RNA-seq pipeline using this command:
```{bash}
nextflow run nf-core/rnaseq --input CCLE_samplesheet.csv --outdir CCLE/nf_core/ --multiqc_title CCLE_star_salmon -c CCLE_nextflow.config -profile singularity,slurm --fasta ensembl107_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa --gtf ensembl107_GRCh38/Homo_sapiens.GRCh38.107.gtf -r 3.10.1
```
For the output data, SRR accession numbers were mapped back to the Cell line names using the SRARunTable metadata information. These cell line names were mapped to cellosaurus IDs.
The metadata file contains the Cellosaurus ID, the SRR accession numbers, the cell line names, metadata from SRA (BioProject, BioSample, Experiment), and metadata from Cellosaurus (cell line name, synonyms, diseases, cross references, BTO ID, CLO ID, sex, category, organism, comments).
NCI-60 (60 cell lines)
Like for NCI-60, fastq files were downloaded from the NCBI SRA Run selector as BioProject PRJNA433861 using the SRA toolkit.
Afterwards, the FASTQ files were processed using the nf-core/RNA-seq pipeline using the same command settings as above:
````{bash}
nextflow run nf-core/rnaseq --input NCI60_samplesheet.csv --outdir NCI60/nf_core/ --multiqc_title NCI60_star_salmon -c NCI60_nextflow.config -profile singularity,slurm --fasta ensembl107_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa --gtf ensembl107_GRCh38/Homo_sapiens.GRCh38.107.gtf -r 3.10.1
```
For the output data, SRR accession numbers were mapped back to the Cell line names using the SRARunTable metadata information. These cell line names were mapped to cellosaurus IDs.
The metadata file contains the Cellosaurus ID, the SRR accession numbers, the cell line names, metadata from SRA (BioProject, BioSample, Experiment), and metadata from Cellosaurus (cell line name, synonyms, diseases, cross references, BTO ID, CLO ID, sex, category, organism, comments).
PDAC mouse data (401 samples)
This data was generated by the MRI (university hospital rechts der Isar, Munich). The data generation strategy is described in PMC6097607. The read_1 samples contain all the cDNA while the read_2 samples only contain UMIs. Hence, only read_1 samples were used.
The FASTQ files were processed using the nf-core/RNA-seq pipeline using this command:
```{bash}
nextflow run nf-core/rnaseq --input MRI_PDAC_samplesheet.csv --outdir MRI_PDAC/nf_core/ --multiqc_title MRI_star_salmon -c MRI_PDAC_nextflow.config -profile singularity,slurm --fasta ensembl110_GRCm39/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz --gtf ensembl110_GRCm39/Mus_musculus.GRCm39.110.gtf.gz -r 3.10.1
```
The metadata file contains information about the experiments and the oncogenes, genotypes and morphology (epithelial/mesenchymal/fibroblast contamination).
创建时间:
2024-03-06



