five

SAIR

收藏
魔搭社区2026-01-06 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/SandboxAQ/SAIR
下载链接
链接失效反馈
官方服务:
资源简介:
Announcing SAIR =============== ***Structurally-Augmented IC50 Repository*** ![](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/6851adb90e980253b2ece115_SAIR-models.png) **In collaboration with Nvidia** # The Largest Publicly Available Binding Affinity Dataset with Cofolded 3D Structures **SAIR (Structurally Augmented IC50 Repository)**, is the largest public dataset of protein--ligand 3D structures paired with binding potency measurements. SAIR contains over one million protein--ligand complexes (1,048,857 unique pairs) and a total of 5.2 million 3D structures, curated from the ChEMBL and BindingDB databases and cofolded using the Boltz-1x model. - **2.5 TB** Of Publicly Available Data - **\>5 Million** Cofolded 3D Structures - **\>1 Million** Unique Protein-Ligand Pairs By providing this unprecedented scale of structure--activity data, we aim to enable researchers to train and evaluate new AI models for drug discovery by bridging the historical gap between molecular structure and drug potency prediction. # **Build with SAIR** SAIR is offered under a CC BY 4.0 license and is now available on Hugging Face. The data are completely **free for commercial and non-commercial use**. SAIR can be used as a baseline for benchmarking biofoundation models or for training and/or fine-tuning new models for predicting binding affinity. We would love to hear from you about other ideas you have to utilize this dataset. # **How to Cite** If you use this work, please cite: ```latex \@article{SANDBOXAQ-SAIR2025, author = {Lemos, Pablo and Beckwith, Zane and Bandi, Sasaank and van Damme, Maarten and Crivelli-Decker, Jordan and Shields, Benjamin J. and Merth, Thomas and Jha, Punit K. and De Mitri, Nicola and Callahan, Tiffany J. and Nish, AJ and Abruzzo, Paul and Salomon-Ferrer, Romelia and Ganahl, Martin}, title = {SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset}, elocation-id = {2025.06.17.660168}, year = {2025}, doi = {10.1101/2025.06.17.660168}} ``` # **Bridging a Gap in AI-Driven Drug Design** Binding affinity prediction is central to drug discovery: it tells us how strongly a candidate molecule (ligand) binds to a target protein, which is key for designing effective drugs. In theory, a ligand's binding affinity is determined by the 3D interactions in the protein--ligand complex. However, deep learning models that *use* 3D structures have been limited by a lack of availability. Very few protein--ligand complexes have both a resolved 3D structure and a measured potency (IC50, Ki, etc.), so most AI approaches have had to rely on indirect data like sequences or 2D chemical structures. One way to overcome this limitation is to generate synthetic training data using predicted structures. Recent advances in protein structure prediction (e.g. AlphaFold) mean we can computationally model protein--ligand complexes and use those for learning. Initial efforts like the PLINDER dataset demonstrated the promise of this approach. SAIR was created to dramatically expand on this idea -- providing a massive repository of computationally folded protein--ligand structures *with* corresponding experimental affinity values. Our goal is to fill the data gap and catalyze more accurate and robust ML models for binding affinity prediction. # **More Information** - [Scientific Manuscript](https://go.sandboxaq.com/rs/175-UKR-711/images/sair_paper.pdf) - [Data License](https://storage.cloud.google.com/sandboxaq-sair/LICENSE.txt?authuser=0) - [Blogpost](https://www.sandboxaq.com/post/sair-the-structurally-augmented-ic50-repository) # **Contact us about SAIR** Here at SandboxAQ, we're releasing SAIR to our customers and the world as just a start on revamping drug discovery. Expect new datasets, AI models, and transformative solutions to follow, across the drug development pipeline. If you're interested to learn more about SAIR, or to see how it or models trained upon it might be expanded to include targets of special interest to your business, we'd love to hear from you. Contact us at **[SAIR@sandboxaq.com](mailto:SAIR@sandboxaq.com)**. # **Downloading the dataset** The following is an example of how you can download the data (both the .parquet file and the structure files) via Python: ```python import os import tarfile from huggingface_hub import hf_hub_url, list_repo_files, hf_hub_download from datasets import load_dataset from tqdm import tqdm import pandas as pd def load_sair_parquet(destination_dir: str) -> pd.DataFrame: """ Downloads the sair.parquet file from the SandboxAQ/SAIR dataset and loads it into a pandas DataFrame. Args: destination_dir (str): The local path where the parquet file will be downloaded. The directory will be created if it doesn't exist. Returns: pd.DataFrame: A pandas DataFrame containing the data from the sair.parquet file. """ # --- 1. Setup and Repository Configuration --- repo_id = "SandboxAQ/SAIR" parquet_filename = "sair.parquet" print(f"Targeting repository: {repo_id}") print(f"Targeting file: {parquet_filename}") print(f"Destination directory: {destination_dir}") # Create the destination directory if it doesn't already exist os.makedirs(destination_dir, exist_ok=True) print(f"Ensured destination directory exists.") # --- 2. Download the Parquet file from the Hugging Face Hub --- download_path = os.path.join(destination_dir, parquet_filename) print(f"\nDownloading '{parquet_filename}'...") try: # Use hf_hub_download to get the file hf_hub_download( repo_id=repo_id, filename=parquet_filename, repo_type="dataset", local_dir=destination_dir, local_dir_use_symlinks=False, ) print(f"Successfully downloaded to '{download_path}'") except Exception as e: print(f"An error occurred while downloading '{parquet_filename}': {e}") return None # --- 3. Load the Parquet file into a pandas DataFrame --- try: print(f"Loading '{parquet_filename}' into a pandas DataFrame...") df = pd.read_parquet(download_path) print("Successfully loaded DataFrame.") return df except Exception as e: print(f"Failed to load parquet file '{download_path}': {e}") return None def download_and_extract_sair_structures( destination_dir: str, file_subset: list[str] = None, cleanup: bool = True ): """ Downloads and extracts .tar.gz files from the SandboxAQ/SAIR dataset on Hugging Face. This function connects to the specified Hugging Face repository, identifies all .tar.gz files within the 'structures_compressed' directory, and downloads and extracts them to a local destination. It can download either all files or a specified subset. Args: destination_dir (str): The local path where the files will be downloaded and extracted. The directory will be created if it doesn't exist. file_subset (list[str], optional): A list of specific .tar.gz filenames to download. If None, all .tar.gz files in the directory will be downloaded. Defaults to None. cleanup (bool, optional): If True, the downloaded .tar.gz archive will be deleted after successful extraction. Defaults to True. Raises: ValueError: If any of the files specified in file_subset are not found in the repository. """ # --- 1. Setup and Repository Configuration --- repo_id = "SandboxAQ/SAIR" repo_folder = "structures_compressed" print(f"Targeting repository: {repo_id}") print(f"Destination directory: {destination_dir}") # Create the destination directory if it doesn't already exist os.makedirs(destination_dir, exist_ok=True) print(f"Ensured destination directory exists.") # --- 2. Get the list of relevant files from the Hugging Face Hub --- try: all_files = list_repo_files(repo_id, repo_type="dataset") # Filter for files within the specified folder that are tar.gz archives repo_tars = [ f.split('/')[-1] for f in all_files if f.startswith(repo_folder + '/') and f.endswith(".tar.gz") ] print(f"Found {len(repo_tars)} total .tar.gz files in '{repo_folder}'.") except Exception as e: print(f"Error: Could not list files from repository '{repo_id}'. Please check the name and your connection.") print(f"Details: {e}") return # --- 3. Determine which files to download --- if file_subset: # Validate that all requested files actually exist in the repository invalid_files = set(file_subset) - set(repo_tars) if invalid_files: raise ValueError(f"The following requested files were not found in the repository: {list(invalid_files)}") files_to_download = file_subset print(f"A subset of {len(files_to_download)} files was specified for download.") else: files_to_download = repo_tars print("No subset specified. All .tar.gz files will be downloaded.") # --- 4. Download and Extract each file --- for filename in tqdm(files_to_download, desc="Processing files"): # Construct the full path within the repository repo_filepath = f"{repo_folder}/{filename}" download_path = os.path.join(destination_dir, repo_filepath) print(f"\nDownloading '{filename}'...") try: # Download the file from the Hub hf_hub_download( repo_id=repo_id, filename=repo_filepath, repo_type="dataset", local_dir=destination_dir, local_dir_use_symlinks=False, ) print(f"Successfully downloaded to '{download_path}'") # Extract the downloaded .tar.gz file print(f"Extracting '{filename}'...") with tarfile.open(download_path, "r:gz") as tar: tar.extractall(path=destination_dir) print(f"Successfully extracted contents to '{destination_dir}'") except Exception as e: print(f"An error occurred while processing '{filename}': {e}") continue finally: # Clean up the downloaded archive if the flag is set and the file exists if cleanup and os.path.exists(download_path): os.remove(download_path) print(f"Cleaned up (deleted) '{download_path}'") print("\nOperation completed.") if __name__ == '__main__': # --- Download the parquet dataset --- # Define a destination for the data output_directory = "./sair_data" # Call the function to download and load the data sair_df = load_sair_parquet(destination_dir=output_directory) # Check if the DataFrame was loaded successfully if sair_df is not None: print("\n--- DataFrame Info ---") sair_df.info() print("\n--- DataFrame Head ---") print(sair_df.head()) # --- Download a specific subset of structure tarballs --- print("--- Running Scenario 2: Download a specific subset ---") # Define the specific files you want to download # Replace this with None to download *all* structures # (remember, this is >100 files of ~10GB each!) subset_to_get = [ "sair_structures_1006049_to_1016517.tar.gz", "sair_structures_100623_to_111511.tar.gz", ] download_and_extract_sair_structures(destination_dir=output_directory, file_subset=subset_to_get) ```

# 发布SAIR =============== ***结构增强IC50数据库(Structurally-Augmented IC50 Repository)*** ![](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/6851adb90e980253b2ece115_SAIR-models.png) **与英伟达(Nvidia)合作完成** # 全球规模最大的带共折叠三维结构的公开结合亲和力数据集 **SAIR(结构增强IC50数据库,Structurally-Augmented IC50 Repository)** 是目前规模最大的公开蛋白质-配体(ligand)三维结构与结合活性测量值配对数据集。SAIR包含超过100万个蛋白质-配体复合物(1,048,857个独特配对)和总计520万个三维结构,数据源自ChEMBL与BindingDB数据库,并通过Boltz-1x模型完成共折叠。 - **2.5 TB** 公开可用数据总量 - **>500万** 个共折叠三维结构 - **>100万** 个独特蛋白质-配体配对 凭借前所未有的大规模结构-活性数据,我们旨在填补分子结构与药物活性预测间长期存在的鸿沟,助力研究人员开发并评估用于药物发现的新型人工智能模型。 # 基于SAIR开展研究 SAIR采用知识共享署名4.0(CC BY 4.0)许可协议,现已在Hugging Face平台上线。该数据集完全免费,可用于商业与非商业用途。 SAIR可作为基准数据集,用于评测生物基础模型,或训练、微调用于预测结合亲和力的新型模型。我们期待聆听您对该数据集的其他应用创意。 # 引用方式 若您使用本数据集,请引用以下文献: latex @article{SANDBOXAQ-SAIR2025, author = {Lemos, Pablo and Beckwith, Zane and Bandi, Sasaank and van Damme, Maarten and Crivelli-Decker, Jordan and Shields, Benjamin J. and Merth, Thomas and Jha, Punit K. and De Mitri, Nicola and Callahan, Tiffany J. and Nish, AJ and Abruzzo, Paul and Salomon-Ferrer, Romelia and Ganahl, Martin}, title = {SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset}, elocation-id = {2025.06.17.660168}, year = {2025}, doi = {10.1101/2025.06.17.660168}} # 填补AI驱动药物设计中的数据鸿沟 结合亲和力预测是药物发现的核心环节:它可以衡量候选分子(配体)与靶标蛋白质的结合强度,是开发高效药物的关键依据。理论上,配体的结合亲和力由蛋白质-配体复合物的三维相互作用决定。但目前依赖三维结构的深度学习模型,其发展受限于可用数据的匮乏。目前仅有极少数蛋白质-配体复合物同时拥有已解析的三维结构与实测活性值(如IC50、Ki等),因此绝大多数人工智能方法只能依赖序列或二维化学结构等间接数据开展研究。 克服这一局限的途径之一,是利用预测结构生成合成训练数据。近年来蛋白质结构预测技术(如AlphaFold)取得突破,使得我们可以通过计算构建蛋白质-配体复合物模型,并将其用于模型训练。诸如PLINDER数据集等早期尝试已验证了该思路的可行性。SAIR旨在大幅拓展这一方向,提供海量经计算折叠的蛋白质-配体结构,以及与之对应的实验亲和力数值。我们的目标是填补这一数据空白,推动开发更精准、更鲁棒的结合亲和力预测机器学习模型。 # 更多相关资源 - [学术论文](https://go.sandboxaq.com/rs/175-UKR-711/images/sair_paper.pdf) - [数据许可协议](https://storage.cloud.google.com/sandboxaq-sair/LICENSE.txt?authuser=0) - [博客文章](https://www.sandboxaq.com/post/sair-the-structurally-augmented-ic50-repository) # 联系我们 SandboxAQ团队推出SAIR,旨在革新药物发现领域,这只是我们的第一步。后续我们还将在药物开发全流程中推出更多数据集、人工智能模型与突破性解决方案。若您希望了解更多SAIR相关信息,或希望了解如何将该数据集及其训练模型拓展至您业务关注的特定靶标,欢迎与我们联系。请发送邮件至**[SAIR@sandboxaq.com](mailto:SAIR@sandboxaq.com)**。 # 数据集下载方法 以下为通过Python语言下载该数据集(包括.parquet文件与结构文件)的示例代码: python import os import tarfile from huggingface_hub import hf_hub_url, list_repo_files, hf_hub_download from datasets import load_dataset from tqdm import tqdm import pandas as pd def load_sair_parquet(destination_dir: str) -> pd.DataFrame: """ Downloads the sair.parquet file from the SandboxAQ/SAIR dataset and loads it into a pandas DataFrame. Args: destination_dir (str): The local path where the parquet file will be downloaded. The directory will be created if it doesn't exist. Returns: pd.DataFrame: A pandas DataFrame containing the data from the sair.parquet file. """ # --- 1. Setup and Repository Configuration --- repo_id = "SandboxAQ/SAIR" parquet_filename = "sair.parquet" print(f"Targeting repository: {repo_id}") print(f"Targeting file: {parquet_filename}") print(f"Destination directory: {destination_dir}") # Create the destination directory if it doesn't already exist os.makedirs(destination_dir, exist_ok=True) print(f"Ensured destination directory exists.") # --- 2. Download the Parquet file from the Hugging Face Hub --- download_path = os.path.join(destination_dir, parquet_filename) print(f" Downloading '{parquet_filename}'...") try: # Use hf_hub_download to get the file hf_hub_download( repo_id=repo_id, filename=parquet_filename, repo_type="dataset", local_dir=destination_dir, local_dir_use_symlinks=False, ) print(f"Successfully downloaded to '{download_path}'") except Exception as e: print(f"An error occurred while downloading '{parquet_filename}': {e}") return None # --- 3. Load the Parquet file into a pandas DataFrame --- try: print(f"Loading '{parquet_filename}' into a pandas DataFrame...") df = pd.read_parquet(download_path) print("Successfully loaded DataFrame.") return df except Exception as e: print(f"Failed to load parquet file '{download_path}': {e}") return None def download_and_extract_sair_structures( destination_dir: str, file_subset: list[str] = None, cleanup: bool = True ): """ Downloads and extracts .tar.gz files from the SandboxAQ/SAIR dataset on Hugging Face. This function connects to the specified Hugging Face repository, identifies all .tar.gz files within the 'structures_compressed' directory, and downloads and extracts them to a local destination. It can download either all files or a specified subset. Args: destination_dir (str): The local path where the files will be downloaded and extracted. The directory will be created if it doesn't exist. file_subset (list[str], optional): A list of specific .tar.gz filenames to download. If None, all .tar.gz files in the directory will be downloaded. Defaults to None. cleanup (bool, optional): If True, the downloaded .tar.gz archive will be deleted after successful extraction. Defaults to True. Raises: ValueError: If any of the files specified in file_subset are not found in the repository. """ # --- 1. Setup and Repository Configuration --- repo_id = "SandboxAQ/SAIR" repo_folder = "structures_compressed" print(f"Targeting repository: {repo_id}") print(f"Destination directory: {destination_dir}") # Create the destination directory if it doesn't already exist os.makedirs(destination_dir, exist_ok=True) print(f"Ensured destination directory exists.") # --- 2. Get the list of relevant files from the Hugging Face Hub --- try: all_files = list_repo_files(repo_id, repo_type="dataset") # Filter for files within the specified folder that are tar.gz archives repo_tars = [ f.split('/')[-1] for f in all_files if f.startswith(repo_folder + '/') and f.endswith(".tar.gz") ] print(f"Found {len(repo_tars)} total .tar.gz files in '{repo_folder}'.") except Exception as e: print(f"Error: Could not list files from repository '{repo_id}'. Please check the name and your connection.") print(f"Details: {e}") return # --- 3. Determine which files to download --- if file_subset: # Validate that all requested files actually exist in the repository invalid_files = set(file_subset) - set(repo_tars) if invalid_files: raise ValueError(f"The following requested files were not found in the repository: {list(invalid_files)}") files_to_download = file_subset print(f"A subset of {len(files_to_download)} files was specified for download.") else: files_to_download = repo_tars print("No subset specified. All .tar.gz files will be downloaded.") # --- 4. Download and Extract each file --- for filename in tqdm(files_to_download, desc="Processing files"): # Construct the full path within the repository repo_filepath = f"{repo_folder}/{filename}" download_path = os.path.join(destination_dir, repo_filepath) print(f" Downloading '{filename}'...") try: # Download the file from the Hub hf_hub_download( repo_id=repo_id, filename=repo_filepath, repo_type="dataset", local_dir=destination_dir, local_dir_use_symlinks=False, ) print(f"Successfully downloaded to '{download_path}'") # Extract the downloaded .tar.gz file print(f"Extracting '{filename}'...") with tarfile.open(download_path, "r:gz") as tar: tar.extractall(path=destination_dir) print(f"Successfully extracted contents to '{destination_dir}'") except Exception as e: print(f"An error occurred while processing '{filename}': {e}") continue finally: # Clean up the downloaded archive if the flag is set and the file exists if cleanup and os.path.exists(download_path): os.remove(download_path) print(f"Cleaned up (deleted) '{download_path}'") print(" Operation completed.") if __name__ == '__main__': # --- Download the parquet dataset --- # Define a destination for the data output_directory = "./sair_data" # Call the function to download and load the data sair_df = load_sair_parquet(destination_dir=output_directory) # Check if the DataFrame was loaded successfully if sair_df is not None: print(" --- DataFrame Info ---") sair_df.info() print(" --- DataFrame Head ---") print(sair_df.head()) # --- Download a specific subset of structure tarballs --- print("--- Running Scenario 2: Download a specific subset ---") # Define the specific files you want to download # Replace this with None to download *all* structures # (remember, this is >100 files of ~10GB each!) subset_to_get = [ "sair_structures_1006049_to_1016517.tar.gz", "sair_structures_100623_to_111511.tar.gz", ] download_and_extract_sair_structures(destination_dir=output_directory, file_subset=subset_to_get)
提供机构:
maas
创建时间:
2025-09-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作