Data from: Reference-free transcriptome assembly in non-model animals from next generation sequencing data

DataONE2012-03-29 更新2024-06-27 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Next-generation sequencing (NGS) technologies offer the opportunity for population genomic study of non-model organisms sampled in the wild. The transcriptome is a convenient and popular target for such purposes. However, designing genetic markers from NGS transcriptome data requires assembling gene-coding sequences out of short reads. This is a complex task owing to gene duplications, genetic polymorphism, alternative splicing and transcription noise. Typical assembling programmes return thousands of predicted contigs, whose connection to the species true gene content is unclear, and from which SNP definition is uneasy. Here, the transcriptomes of five diverse non-model animal species (hare, turtle, ant, oyster and tunicate) were assembled from newly generated 454 and Illumina sequence reads. In two species for which a reference genome is available, a new procedure was introduced to annotate each predicted contig as either a full-length cDNA, fragment, chimera, allele, paralogue, genomic sequence or other, based on the number of, and overlap between, blast hits to the appropriate reference. Analyses showed that (i) the highest quality assemblies are obtained when 454 and Illumina data are combined, (ii) typical de novo assemblies include a majority of irrelevant cDNA predictions and (iii) assemblies can be appropriately cleaned by filtering contigs based on length and coverage. We conclude that robust, reference-free assembly of thousands of genes from transcriptomic NGS data is possible, opening promising perspectives for transcriptome-based population genomics in animals. A Galaxy pipeline implementing our best-performing assembling strategy is provided.

下一代测序（Next-generation sequencing，NGS）技术为野生采集的非模式生物的群体基因组学研究提供了可行路径。转录组是此类研究中便捷且广受青睐的研究靶标。然而，从NGS转录组数据中开发遗传标记，需要从短读长序列中组装基因编码序列。由于基因重复、遗传多态性、可变剪接及转录噪声的存在，该任务极具挑战性。常规组装软件会输出数千条预测得到的重叠群（contig），但这些序列与物种真实基因含量的对应关系尚不明确，且单核苷酸多态性（SNP）的定义颇具难度。本研究针对五种不同的非模式动物物种（野兔、龟、蚂蚁、牡蛎以及被囊动物），利用新生成的454与Illumina测序读长序列完成了其转录组的组装。针对其中两个已拥有参考基因组的物种，本研究提出了一套全新的注释流程：基于与对应参考序列的基本局部比对搜索工具（Basic Local Alignment Search Tool，BLAST）比对命中次数及重叠区域情况，将每条预测的重叠群注释为全长互补DNA（cDNA）、片段、嵌合体、等位基因、旁系同源基因、基因组序列或其他类型。分析结果显示：其一，合并454与Illumina测序数据进行组装，可获得最高质量的转录组；其二，常规从头组装结果中，多数为无意义的cDNA预测序列；其三，可通过基于序列长度与覆盖度过滤重叠群的方式，对组装结果进行有效净化。本研究得出结论：从转录组NGS数据中实现无需参考序列的高质量基因组装，可获得数千个目标基因，这为基于转录组的动物群体基因组学研究开辟了极具潜力的应用前景。本研究提供了一套可实现最优组装策略的Galaxy工作流工具。

创建时间：

2012-03-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集