five

Supporting data for "Improving the annotation of the Heterorhabditis bacteriophora genome"

收藏
DataCite Commons2025-05-26 更新2025-04-15 收录
下载链接:
http://gigadb.org/dataset/100404
下载链接
链接失效反馈
官方服务:
资源简介:
Genome assembly and annotation remains an exacting task. As the tools available for these tasks improve, it is useful to return to data produced with earlier instances to assess their credibility and correctness. The entomopathogenic nematode Heterorhabditis bacteriophora is widely used to control insect pests in horticulture. The genome sequence for this species was reported to encode an unusually high proportion of unique proteins and a paucity of secreted proteins compared to other related nematodes. We revisited the H. bacteriophora genome assembly and gene predictions to ask whether these unusual characteristics were biological or methodological in origin. We mapped an independent resequencing dataset to the genome and used the blobtools pipeline to identify potential contaminants. While present (0.2% of the genome span, 0.4% of predicted proteins), assembly contamination was not significant. Re-prediction of the gene set using BRAKER1 and published transcriptome data generated a predicted proteome that was very different from the published one. The new gene set had a much reduced complement of unique proteins, better completeness values that were in line with other related species' genomes, and an increased number of proteins predicted to be secreted. It is thus likely that methodological issues drove the apparent uniqueness of the initial H. bacteriophora genome annotation and that similar contamination and misannotation issues affect other published genome assemblies.

基因组组装与注释仍是一项极具挑战性的工作。随着用于此类研究的工具不断迭代升级,回溯早期版本工具生成的数据以评估其可信度与准确性显得尤为必要。昆虫病原线虫(entomopathogenic nematode)异小杆线虫(Heterorhabditis bacteriophora)被广泛应用于园艺作物害虫的生物防治。据报道,与其他近缘线虫相比,该物种的基因组序列所编码的独特蛋白比例异常偏高,而分泌蛋白数量却相对匮乏。本研究重新审视了异小杆线虫(Heterorhabditis bacteriophora)的基因组组装与基因预测结果,旨在探究这些异常特征究竟源于生物学特性还是实验方法本身的局限。我们将独立的重测序数据集比对至该基因组,并使用blobtools流程识别潜在的污染物序列。尽管基因组组装中确实存在污染物(占基因组序列长度的0.2%,占预测蛋白总数的0.4%),但其污染程度并不显著。使用BRAKER1工具结合已发表的转录组数据对基因集进行重新预测后,得到的预测蛋白组与初始发表的结果存在显著差异。新构建的基因集所含独特蛋白数量大幅减少,基因组完整性指标更优,且与其他近缘物种的基因组水平相符,同时预测得到的分泌蛋白数量也有所提升。因此,初始发表的异小杆线虫基因组注释所呈现出的独特性特征,大概率源于实验方法层面的问题;类似的污染与注释错误问题,也可能存在于其他已发表的基因组组装研究中。
提供机构:
GigaScience Database
创建时间:
2018-03-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作