five

AlphaFind v2: Evaluation data, results and reproducibility protocol

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/AlphaFind_v2_Evaluation_data_results_and_reproducibility_protocol/31802743
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains the evalulation dataset, code and results reported on in the publication: AlphaFind v2: Similarity Search in AlphaFold DB and TED Domains across Structural Contexts (https://doi.org/10.64898/2026.03.10.710735). The evaluation uses the multi-domain protein selection from https://doi.org/10.6084/m9.figshare.30546650 (afdb-benchmark/af-cath-multi-domain-list.tsv) and downloads this data from AlphaFold DB (https://alphafold.ebi.ac.uk/) and TED DB (https://ted.cathdb.info/) API services. The dataset is then saved as afdb-structures/ (2050 multidomain protein chains from AlphaFold DB) and afdb-structures-domains/ (4420 TED domains extracted from the 2050 multidomain proteins). Contentsalphafind-evaluation-data.zip ├── afdb-benchmark/├── afdb-structures/└── afdb-structures-domains/results.zip ├── foldseek_results/ # Raw FoldSeek API responses│ └── afdb50_AF-{UNIPROT_ID}-F1-model_v6.json│├── foldseek_results_tmscores/ # FoldSeek results with TM-scores│ └── foldseek_results_tmscores_{UNIPROT_ID}.csv│├── alphafindv1_results/ # AlphaFind v1 search results│ └── {UNIPROT_ID}_chainA_limit{K}.json│├── alphafindv2_results/ # AlphaFind v2 chain search results│ └── {UNIPROT_ID}_chains_k{K}.json│├── alphafindv2_domains_results/ # AlphaFind v2 domain search results│ └── {UNIPROT_ID}_TED{NN}_chains_k{K}.json│├── merizo_results/ # Merizo domain search results│ ├── AF-{UNIPROT_ID}-F1-model_v4_TED{NN}_results.json│ └── AF-{UNIPROT_ID}-F1-model_v4_TED{NN}_search.tsv│├── figures/ # Comparison plots│ ├── chains_comparison_tm.pdf # Chains TM-score boxplot│ ├── chains_comparison_tm.png│ ├── domains_comparison_tm.pdf # Domains TM-score boxplot│ └── domains_comparison_tm.png│├── *_with_timing.csv # Timing data for each method├── foldseek-nresults.csv # Result counts per query└── *_downloads.csv # Download logsstatistical_tests.zip ├── aggregate_results_chains_statistics.csv├── aggregate_results_chains_with_stats.py├── aggregate_results_domains_statistics.csv└── statistical_tests.mdalphafind-v2-evaluation-scripts.zip ├── README.md├── aggregate_results.py├── aggregate_results_chains_with_stats.py├── compute-tms.py├── count-foldseek-results.py├── download_data.py├── eval-alphafindv1.py├── eval-alphafindv2.py├── eval-alphafindv2_domains.py├── eval-foldseek.py├── eval-merizo.py├── extract-foldseek.py├── find_domain_outliers.py├── plot_domains_comparison.py├── plot_input_statistics.py├── plot_results_comparison.py├── requirements.txt└── visualize_results.py figures.zip ├── chains_comparison.pdf ├── chains_comparison_time.pdf ├── chains_comparison_tm.pdf ├── domains_comparison_time.pdf └── domains_comparison_tm.pdf ├── cath_domains_per_chain.pdf ├── cath_unique_families_per_chain.pdf ├── chain_atoms_histogram.pdf ├── chain_residues_histogram.pdf ├── domain_atoms_histogram.pdf └── domain_residues_histogram.pdfHow to reproduceThe instructions are also in the README.md of alphafind-v2-evaluation-scripts.zip Prerequisites Python (Originally run on Python 3.10.16)USalign - for TM-score computation, make for USalign compilationgit clone https://github.com/pylelab/USalign.git cd USalign && make Python dependenciespip install numpy pandas scipy requests tqdm matplotlib Download the data python download_data.py This downloads: Protein chain PDB files to afdb-structures/Domain PDB files to afdb-structures-domains/Alternatively, you can use the included alphafind-evaluation-data.zip and just move the subdirectories into the main directory structure: cd alphafind-evaluation-data/ && mv * ../. Run FoldSeek Search Run FoldSeek Server API search first (required to determine result counts for other methods): python eval-foldseek.py Searches against `afdb-50` databaseResults saved to `results/foldseek_results/`Timing saved to `results/foldseek_results_with_timing.csv`Prepare FoldSeek results for TM-Score computation python extract-foldseek.py Extracts results from foldseek evaluation to individual CSV files in results/foldseek_results_tmscores/ Compute TM-Scores python compute-tms.py --input-dir results/foldseek_results_tmscores Uses USalign to compute TM-scores for FoldSeek results. The timing is not included in search time. Count the FoldSeek results python count-foldseek-results.py Creates foldseek-nresults.csv used to match result counts in AlphaFind queries. Run AlphaFind v1 Search python eval-alphafindv1.py Queries specify UniProt ID, chain (A), and limit matching FoldSeek result countsResults saved to `results/alphafindv1_results/`TM-scores returned directly by APIRun AlphaFind v2 Search For chains: python eval-alphafindv2.py For domains:python eval-alphafindv2_domains.py Queries use `k` parameter matching baseline result countsTwo timing metrics recorded:- Approximate time: until initial results collected- TM-score time: until exact TM-scores computedResults saved to results/alphafindv2_results/ or results/alphafindv2_domains_results/Run Merizo Search (Domains) python eval-merizo.py Searches against TED databaseResults saved to `results/merizo_results/`TM-scores returned directly (columns: `q_tm`, `t_tm`, `max_tm`)Aggregate Results with Statistical Testing This step produces final summary tables and p-values. cd results/ && python ../aggregate_results_chains_with_stats.py Outputs: aggregate_results_chains.csv - Chain performance summaryaggregate_results_chains_statistics.csv - Chain statistical testsaggregate_results_domains.csv - Domain performance summaryaggregate_results_domains_statistics.csv - Domain statistical testsGenerate Visualization Plots python plot_input_statistics.py From results/ directory: cd results/ python ../plot_results_comparison.py # Chains boxplot python ../plot_domains_comparison.py # Domains boxplot Outputs: input_statistics/*.pdf - Input data histogramsresults/figures/chains_comparison_tm.pdf - Chains TM-score comparisonresults/figures/domains_comparison_tm.pdf - Domains TM-score comparison
创建时间:
2026-03-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作