Datasets and software for article "Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data"
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14762346
下载链接
链接失效反馈官方服务:
资源简介:
Datasets for Tabigecy article
This repository contains additional information for the article "Predicting coarse-grained representations of biogeochemical cycles from metabarcoding data". This information allows the results presented in the article to be reproduced:
- article_data: this zip archive contains:
input files used for the article experiments:
bordenave_et_al_2013.tsv, bordenave_et_al_2013_abundance.csv and bordenave_et_al_2013_group.tsv for the Bordenave et al. dataset.
schwab_et_al_2022.tsv, schwab_et_al_2022_abundance.tsv and schwab_et_al_2022_sample_grouping.tsv for the Schwab et al. dataset.
output folders from Tabigecy run on these inputs:
output_bordenave: output folder for Bordenave et al. dataset.
output_schwab: output folder for Schwab et al. dataset.
scripts used to create plots:
create_pca.R to create PCA biplot and correlation plot from the results of both datasets.
bordenave_create_figure_article.py to create polar plots for the Bordenave et al. dataset.
schwab_create_figure_article.py to create polar plots for the Schwab et al. dataset.
several folders containing svg files for the article figures: bordenave_figure, experiment_figure, schwab_figure and workflow_figure.
original_data: original input files for both datasets.
- input_files_esmecata_precomputed_db.zip: six input files created by SPARQL queries on UniProt to extract all taxa associated with species, genus, family, order, class and phylum. They were created using a script available in EsMeCaTa repository: esmecata/precomputed/create_input_precomputation.py. These files were used as input to esmecata proteomes to create the precomputed database.
- database_proteomes_folder.zip: compressed archive containing the proteomes retrieved by EsMeCaTa for species, genus, family, order, class and phylum to create the EsMeCaTa precomputed database version 1.0.0. It is the result of combining the different runs of esmecata proteomes on the 6 taxonomic ranks for all the associated taxa of UniProt. From this folder, the precomputed database has been created by means of the following commands:
Clustering of the proteomes:
esmecata clustering -i database_proteomes_folder -o database_output_clustering -c 32 --remove-tmp
Annotation of the consensus proteomes:
esmecata annotation -i database_output_clustering -o database_output_annotation -e /path/to/eggnog/database -c 32
Merging results from the three folders into the precomputed database:
esmecata_create_db from_runs -iproteomes database_proteomes_folder -iclustering database_output_clustering -iannotation database_output_annotation -o esmecata_precomputed_database --db-version "1.0 -c 10
- software.zip: compressed archive containing the code of the tools developed and used in the article:
bigecyhmm-0.1.5.zip: contains the code of bigecyhmm version 0.1.5 used in the article.
esmecata-0.6.0.zip: contains the code of esmecata version 0.6.0 used in the article.
tabigecy-0.1.1.zip: contains the code of tabigecy version 0.1.1 used in the article.
- taxdmp_2024-10-01.tar.gz: The version of the NCBI Taxonomy database used in the article. To use this version of the database with EsMeCaTa, you have to import it with ete3 using the following command:
python3 -c "from ete3 import NCBITaxa; ncbi = NCBITaxa(); ncbi.update_taxonomy_database('taxdmp_2024-10-01.tar.gz')"
Perform experiments
Analyses performed in the article can be reproduced by running tabigecy on the input files of the article_data archive.
To do so, add the EsMeCaTa input file (either bordenave_et_al_2013.tsv or schwab_et_al_2022.tsv`) to the parameter --infile and the abundance file (either bordenave_et_al_2013_abundance.csv or schwab_et_al_2022_abundance.tsv) to the parameter --inAbundfile. The precomputed database is required and can be given with the parameter --precomputedDB. The database can be downloaded from Zenodo.
Commands for the Bordenave et al. dataset:
nextflow run ArnaudBelcour/tabigecy --infile bordenave_et_al_2013.tsv --inAbundfile bordenave_et_al_2013_abundance.csv --precomputedDB /path/to/esmecata_database.zip --outputFolder output_bordenave --coreBigecyhmm xx
Commands for the Schwab et al. dataset:
nextflow run ArnaudBelcour/tabigecy --infile schwab_et_al_2022.tsv --inAbundfile schwab_et_al_2022_abundance.tsv --precomputedDB /path/to/esmecata_database.zip --outputFolder output_schwab --coreBigecyhmm xx
To decrease the runtime of the workflow, it is advised to give several cores to `--coreBigecyhmm xx`. With 5 cores, the runtime of the workflow is around 13 minutes.
To create polar plots, call the two Python scripts at the same location where the input files and output folder are (inside article_data folder):
python3 bordenave_create_figure_article.py
python3 schwab_create_figure_article.py
To create the PCA and correlation plots, launch the R script on the same location:
Rscript create_pca.R
Metadata
The experiments were performed with the following tool versions:
Tool
Version
Java (OpenJDK)
11.0.22
Nextflow
24.10.3
Tabigecy
0.1.1
Python
3.12.2
EsMeCaTa
0.6.0
EsMeCaTa precomputed database
1.0.0
ete3
3.1.3
biopython
1.83
bigecyhmm
0.1.5
pandas
1.5.3
plotly
5.19.0
matplotlib
3.9.2
seaborn
0.13.2
kaleido
0.2.1
pyhmmer
0.10.8
pillow
10.1.0
R
4.4.1
factoextra
1.0.7
ade4
1.7-22
corrplot
0.94
创建时间:
2025-01-30



