five

Key datafiles for marine microbe gene transfer manuscript

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14867578
下载链接
链接失效反馈
官方服务:
资源简介:
Key datafiles for manuscript titled "Co-occurrence drives horizontal gene transfer among marine prokaryotes".                Large files (uploaded separately, to make it easier to download the zipfile of smaller files): gene_to_cluster.tsv.gz - Mapping of gene ID to cluster ID (based on running CD-HIT to define gene clusters). Used for focal "cluster-based" HGT identification method. metaG_hyperg_cooccur.allsamples.combined.prepped.tsv.gz - Massive table of genome pairs input to logistic regression for focal model result. Includes all variables, with rows containing NA values (e.g., for cases where genomes could not be placed into tree to get tip distance) removed. This is the data used for the focal logistic regression model we analyzed.                         Zipped folder structure (ocean_hgt_zenodo.zip): cooccur/ - Folder containing co-occurrence results, and related information. Unless otherwise stated, files correspond to focal genome dataset.  Tara_present_metadata_median_by_sample.tsv.gz - Median environmental values per genome (based on metaG_presence_tara.tsv.gz profiles) allsamples_present_metadata_median_by_sample.tsv.gz - Median environmental values per genome (based on metaG_presence_allsamples.tsv.gz profiles) geotraces_present_metadata_median_by_sample.tsv.gz - Median environmental values per genome (based on metaG_presence_geotraces.tsv.gz profiles) metaG_presence_allsamples.tsv.gz - Presence of genomes across all metagenomics samples. metaG_presence_geotraces.tsv.gz - Presence of genomes across GEOTRACES metagenomics samples (these and other sample subset files can be slightly different due to how genomes and samples were filtered based on their prevalence in these tables). metaG_presence_tara.tsv.gz - Presence of genomes across Tara metagenomics samples. metaG_rpkm_allsamples.tsv.gz - Reads Per Kilobase per Million mapped reads (RPKM) of genomes across all metagenomics samples. metaG_rpkm_geotraces.tsv.gz - RPKM of genomes across samples where they were called as present. Restricted to GEOTRACES samples only. metaG_rpkm_tara.tsv.gz - RPKM of genomes across samples where they were called as present. Restricted to Tara samples only. progenomes/ - Folder containing files with same name as above, but for proGenomes dataset. hgt_cooccur_phylo/ - Folder containing results related to HGT vs. co-occurrence vs. phylogenetic distance (and other variables) lessfiltered_taxa_hgt/ - Folder with tables summarizing information for rare genome pairs identified as "less-filtered", which were driving the significant signal in the focal set. hyperg_clusterbased_lessfilt_hgt_prepped_input.tsv.gz - The subset of the input to the model for these taxa pairs. hyperg_clusterbased_lessfilt_hgt_taxonomy.tsv.gz - Their taxonomy, as shown in a supplementary table. model_summaries.tsv.gz - Summary of all models run, similar to results shown in manuscript, but for additional models and columns indicating 95% confidence intervals and P-values. mapfiles/ Dmitrijeva2024_COG_category_HGT_summary.tsv.gz - Breakdown of which COG categories are expected to be depleted / enriched among HGT hits, based on prior study's literature review. MAG_taxa_breakdown.tsv.gz - Taxonomic breakdown for focal genomes. progenomes_ncbi_taxonomy.tsv.gz - Taxonomic breakdown for proGenomes subset, based on parsing NCBI taxonomy from TaxIDs. phylogeny/ focal_Aligned_SCGs.phy.treefile - Newick tree for focal genomes. progenomes_Aligned_SCGs.phy.treefile - Newick tree for proGenomes subset genomes. putative_hgt/ blast/ - BLASTn-based HGT output on focal dataset. blast_hit_bedfiles/ - BEDfiles with all BLASTn results presented (with lower scoring hits overlapping with these removed). Results are split by sequence identity cut-off and taxonomic level of comparison. Each region hit is displayed over two lines, with the same "name", but with a different subject and query indicated. hit_gene_counts_and_lengths.tsv.gz - Breakdown of numbers of genes and region length per hit. cross_level_tallies_norm.tsv.gz - Numbers of hits per taxonomic level. cluster/ - Cluster-based HGT results on focal dataset. all_best_hits.tsv.gz - Putative HGT events based on best hits of cluster across genomes in different genera or above. clusterbased_COG_category_enrichment_no.unannot.tsv.gz - COG category enrichment results for clusters involved with HGT. clusterbased_hgt_tallies.tsv.gz - Numbers of hits between genome pairs at each identity cut-off. clusterbased_posthoc_COG_enrich.tsv.gz - Targetted COG enrichment results (on subset of COG categories of interest at >= 99% identity). cross_level_tallies_norm.tsv.gz - Numbers of hits per taxonomic level. num_comparisons_per_inter.level.tsv.gz - Count of number of genomes compared per taxonomic level (in focal genome database). progenomes_cluster/ - Cluster-based HGT results (for ProGenomes database analysis). gene_hgt_tallies.tsv.gz - Numbers of hits between genome pairs at each identity cut-off. putative_hgt_calls.tsv.gz - Breakdown of putative HGT hits. ranger/ - RANGER-DTL-based HGT results on focal dataset. homer_rangerdtl_combined_summary.tsv.gz - Overall summary of HGT events per species.  pairwise_hgt_counts.tsv.gz - HGT tally per taxa pair. taxa_subsets/ - Lists of taxa IDs of interest. focal_genomes_freeliving_associated.tsv.gz - Genomes defined as associated with "free-living" samples in focal dataset. focal_genomes_lessfiltered_associated.tsv.gz - Genomes defined as associated with "less-filtered" samples in focal dataset. hgt_retained_taxa_missing_from_tree.txt.gz - Genome IDs that were retained for HGT and co-occurrence analyses, but could not be placed into tree. progenomes_freeliving_associated.tsv.gz - Genomes defined as associated with "free-living" samples in proGenomes subset dataset. progenomes_highqual_retained.txt.gz - Subset of proGenomes aquatic representative genomes retained as high-quality (non-metagenomic). progenomes_lessfiltered_associated.tsv.gz - Genomes defined as associated with "less-filtered" samples in proGenomes subset dataset.
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作