SAIR
收藏魔搭社区2026-01-06 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/SandboxAQ/SAIR
下载链接
链接失效反馈官方服务:
资源简介:
Announcing SAIR
===============
***Structurally-Augmented IC50 Repository***

**In collaboration with Nvidia**
# The Largest Publicly Available Binding Affinity Dataset with Cofolded 3D Structures
**SAIR (Structurally Augmented IC50 Repository)**, is the largest public
dataset of protein--ligand 3D structures paired with binding potency
measurements. SAIR contains over one million protein--ligand complexes
(1,048,857 unique pairs) and a total of 5.2 million 3D structures,
curated from the ChEMBL and BindingDB databases and cofolded using the
Boltz-1x model.
- **2.5 TB** Of Publicly Available Data
- **\>5 Million** Cofolded 3D Structures
- **\>1 Million** Unique Protein-Ligand Pairs
By providing this unprecedented scale of structure--activity data, we
aim to enable researchers to train and evaluate new AI models for drug
discovery by bridging the historical gap between molecular structure and
drug potency prediction.
# **Build with SAIR**
SAIR is offered under a CC BY 4.0 license and is now available on
Hugging Face. The data are completely **free for commercial and
non-commercial use**.
SAIR can be used as a baseline for benchmarking biofoundation models or
for training and/or fine-tuning new models for predicting binding
affinity. We would love to hear from you about other ideas you have to
utilize this dataset.
# **How to Cite**
If you use this work, please cite:
```latex
\@article{SANDBOXAQ-SAIR2025,
author = {Lemos, Pablo and Beckwith, Zane and Bandi, Sasaank and van
Damme, Maarten and Crivelli-Decker, Jordan and Shields, Benjamin J. and
Merth, Thomas and Jha, Punit K. and De Mitri, Nicola and Callahan,
Tiffany J. and Nish, AJ and Abruzzo, Paul and Salomon-Ferrer, Romelia
and Ganahl, Martin},
title = {SAIR: Enabling Deep Learning for Protein-Ligand Interactions
with a Synthetic Structural Dataset},
elocation-id = {2025.06.17.660168},
year = {2025},
doi = {10.1101/2025.06.17.660168}}
```
# **Bridging a Gap in AI-Driven Drug Design**
Binding affinity prediction is central to drug discovery: it tells us
how strongly a candidate molecule (ligand) binds to a target protein,
which is key for designing effective drugs. In theory, a ligand's
binding affinity is determined by the 3D interactions in the
protein--ligand complex. However, deep learning models that *use* 3D
structures have been limited by a lack of availability. Very few
protein--ligand complexes have both a resolved 3D structure and a
measured potency (IC50, Ki, etc.), so most AI approaches have had to
rely on indirect data like sequences or 2D chemical structures.
One way to overcome this limitation is to generate synthetic training
data using predicted structures. Recent advances in protein structure
prediction (e.g. AlphaFold) mean we can computationally model
protein--ligand complexes and use those for learning. Initial efforts
like the PLINDER dataset demonstrated the promise of this approach. SAIR
was created to dramatically expand on this idea -- providing a massive
repository of computationally folded protein--ligand structures *with*
corresponding experimental affinity values. Our goal is to fill the data
gap and catalyze more accurate and robust ML models for binding affinity
prediction.
# **More Information**
- [Scientific Manuscript](https://go.sandboxaq.com/rs/175-UKR-711/images/sair_paper.pdf)
- [Data License](https://storage.cloud.google.com/sandboxaq-sair/LICENSE.txt?authuser=0)
- [Blogpost](https://www.sandboxaq.com/post/sair-the-structurally-augmented-ic50-repository)
# **Contact us about SAIR**
Here at SandboxAQ, we're releasing SAIR to our customers and the world
as just a start on revamping drug discovery. Expect new datasets, AI
models, and transformative solutions to follow, across the drug
development pipeline. If you're interested to learn more about SAIR, or
to see how it or models trained upon it might be expanded to include
targets of special interest to your business, we'd love to hear from
you. Contact us at
**[SAIR@sandboxaq.com](mailto:SAIR@sandboxaq.com)**.
# **Downloading the dataset**
The following is an example of how you can download the data
(both the .parquet file and the structure files)
via Python:
```python
import os
import tarfile
from huggingface_hub import hf_hub_url, list_repo_files, hf_hub_download
from datasets import load_dataset
from tqdm import tqdm
import pandas as pd
def load_sair_parquet(destination_dir: str) -> pd.DataFrame:
"""
Downloads the sair.parquet file from the SandboxAQ/SAIR dataset and loads it
into a pandas DataFrame.
Args:
destination_dir (str): The local path where the parquet file will be
downloaded. The directory will be created if it
doesn't exist.
Returns:
pd.DataFrame: A pandas DataFrame containing the data from the
sair.parquet file.
"""
# --- 1. Setup and Repository Configuration ---
repo_id = "SandboxAQ/SAIR"
parquet_filename = "sair.parquet"
print(f"Targeting repository: {repo_id}")
print(f"Targeting file: {parquet_filename}")
print(f"Destination directory: {destination_dir}")
# Create the destination directory if it doesn't already exist
os.makedirs(destination_dir, exist_ok=True)
print(f"Ensured destination directory exists.")
# --- 2. Download the Parquet file from the Hugging Face Hub ---
download_path = os.path.join(destination_dir, parquet_filename)
print(f"\nDownloading '{parquet_filename}'...")
try:
# Use hf_hub_download to get the file
hf_hub_download(
repo_id=repo_id,
filename=parquet_filename,
repo_type="dataset",
local_dir=destination_dir,
local_dir_use_symlinks=False,
)
print(f"Successfully downloaded to '{download_path}'")
except Exception as e:
print(f"An error occurred while downloading '{parquet_filename}': {e}")
return None
# --- 3. Load the Parquet file into a pandas DataFrame ---
try:
print(f"Loading '{parquet_filename}' into a pandas DataFrame...")
df = pd.read_parquet(download_path)
print("Successfully loaded DataFrame.")
return df
except Exception as e:
print(f"Failed to load parquet file '{download_path}': {e}")
return None
def download_and_extract_sair_structures(
destination_dir: str,
file_subset: list[str] = None,
cleanup: bool = True
):
"""
Downloads and extracts .tar.gz files from the SandboxAQ/SAIR dataset on Hugging Face.
This function connects to the specified Hugging Face repository, identifies all
.tar.gz files within the 'structures_compressed' directory, and downloads
and extracts them to a local destination. It can download either all files
or a specified subset.
Args:
destination_dir (str): The local path where the files will be downloaded
and extracted. The directory will be created if it
doesn't exist.
file_subset (list[str], optional): A list of specific .tar.gz filenames
to download. If None, all .tar.gz files
in the directory will be downloaded.
Defaults to None.
cleanup (bool, optional): If True, the downloaded .tar.gz archive will be
deleted after successful extraction. Defaults to True.
Raises:
ValueError: If any of the files specified in file_subset are not found
in the repository.
"""
# --- 1. Setup and Repository Configuration ---
repo_id = "SandboxAQ/SAIR"
repo_folder = "structures_compressed"
print(f"Targeting repository: {repo_id}")
print(f"Destination directory: {destination_dir}")
# Create the destination directory if it doesn't already exist
os.makedirs(destination_dir, exist_ok=True)
print(f"Ensured destination directory exists.")
# --- 2. Get the list of relevant files from the Hugging Face Hub ---
try:
all_files = list_repo_files(repo_id, repo_type="dataset")
# Filter for files within the specified folder that are tar.gz archives
repo_tars = [
f.split('/')[-1] for f in all_files
if f.startswith(repo_folder + '/') and f.endswith(".tar.gz")
]
print(f"Found {len(repo_tars)} total .tar.gz files in '{repo_folder}'.")
except Exception as e:
print(f"Error: Could not list files from repository '{repo_id}'. Please check the name and your connection.")
print(f"Details: {e}")
return
# --- 3. Determine which files to download ---
if file_subset:
# Validate that all requested files actually exist in the repository
invalid_files = set(file_subset) - set(repo_tars)
if invalid_files:
raise ValueError(f"The following requested files were not found in the repository: {list(invalid_files)}")
files_to_download = file_subset
print(f"A subset of {len(files_to_download)} files was specified for download.")
else:
files_to_download = repo_tars
print("No subset specified. All .tar.gz files will be downloaded.")
# --- 4. Download and Extract each file ---
for filename in tqdm(files_to_download, desc="Processing files"):
# Construct the full path within the repository
repo_filepath = f"{repo_folder}/{filename}"
download_path = os.path.join(destination_dir, repo_filepath)
print(f"\nDownloading '{filename}'...")
try:
# Download the file from the Hub
hf_hub_download(
repo_id=repo_id,
filename=repo_filepath,
repo_type="dataset",
local_dir=destination_dir,
local_dir_use_symlinks=False,
)
print(f"Successfully downloaded to '{download_path}'")
# Extract the downloaded .tar.gz file
print(f"Extracting '{filename}'...")
with tarfile.open(download_path, "r:gz") as tar:
tar.extractall(path=destination_dir)
print(f"Successfully extracted contents to '{destination_dir}'")
except Exception as e:
print(f"An error occurred while processing '{filename}': {e}")
continue
finally:
# Clean up the downloaded archive if the flag is set and the file exists
if cleanup and os.path.exists(download_path):
os.remove(download_path)
print(f"Cleaned up (deleted) '{download_path}'")
print("\nOperation completed.")
if __name__ == '__main__':
# --- Download the parquet dataset ---
# Define a destination for the data
output_directory = "./sair_data"
# Call the function to download and load the data
sair_df = load_sair_parquet(destination_dir=output_directory)
# Check if the DataFrame was loaded successfully
if sair_df is not None:
print("\n--- DataFrame Info ---")
sair_df.info()
print("\n--- DataFrame Head ---")
print(sair_df.head())
# --- Download a specific subset of structure tarballs ---
print("--- Running Scenario 2: Download a specific subset ---")
# Define the specific files you want to download
# Replace this with None to download *all* structures
# (remember, this is >100 files of ~10GB each!)
subset_to_get = [
"sair_structures_1006049_to_1016517.tar.gz",
"sair_structures_100623_to_111511.tar.gz",
]
download_and_extract_sair_structures(destination_dir=output_directory, file_subset=subset_to_get)
```
# 发布SAIR
===============
***结构增强IC50数据库(Structurally-Augmented IC50 Repository)***

**与英伟达(Nvidia)合作完成**
# 全球规模最大的带共折叠三维结构的公开结合亲和力数据集
**SAIR(结构增强IC50数据库,Structurally-Augmented IC50 Repository)** 是目前规模最大的公开蛋白质-配体(ligand)三维结构与结合活性测量值配对数据集。SAIR包含超过100万个蛋白质-配体复合物(1,048,857个独特配对)和总计520万个三维结构,数据源自ChEMBL与BindingDB数据库,并通过Boltz-1x模型完成共折叠。
- **2.5 TB** 公开可用数据总量
- **>500万** 个共折叠三维结构
- **>100万** 个独特蛋白质-配体配对
凭借前所未有的大规模结构-活性数据,我们旨在填补分子结构与药物活性预测间长期存在的鸿沟,助力研究人员开发并评估用于药物发现的新型人工智能模型。
# 基于SAIR开展研究
SAIR采用知识共享署名4.0(CC BY 4.0)许可协议,现已在Hugging Face平台上线。该数据集完全免费,可用于商业与非商业用途。
SAIR可作为基准数据集,用于评测生物基础模型,或训练、微调用于预测结合亲和力的新型模型。我们期待聆听您对该数据集的其他应用创意。
# 引用方式
若您使用本数据集,请引用以下文献:
latex
@article{SANDBOXAQ-SAIR2025,
author = {Lemos, Pablo and Beckwith, Zane and Bandi, Sasaank and van
Damme, Maarten and Crivelli-Decker, Jordan and Shields, Benjamin J. and
Merth, Thomas and Jha, Punit K. and De Mitri, Nicola and Callahan,
Tiffany J. and Nish, AJ and Abruzzo, Paul and Salomon-Ferrer, Romelia
and Ganahl, Martin},
title = {SAIR: Enabling Deep Learning for Protein-Ligand Interactions
with a Synthetic Structural Dataset},
elocation-id = {2025.06.17.660168},
year = {2025},
doi = {10.1101/2025.06.17.660168}}
# 填补AI驱动药物设计中的数据鸿沟
结合亲和力预测是药物发现的核心环节:它可以衡量候选分子(配体)与靶标蛋白质的结合强度,是开发高效药物的关键依据。理论上,配体的结合亲和力由蛋白质-配体复合物的三维相互作用决定。但目前依赖三维结构的深度学习模型,其发展受限于可用数据的匮乏。目前仅有极少数蛋白质-配体复合物同时拥有已解析的三维结构与实测活性值(如IC50、Ki等),因此绝大多数人工智能方法只能依赖序列或二维化学结构等间接数据开展研究。
克服这一局限的途径之一,是利用预测结构生成合成训练数据。近年来蛋白质结构预测技术(如AlphaFold)取得突破,使得我们可以通过计算构建蛋白质-配体复合物模型,并将其用于模型训练。诸如PLINDER数据集等早期尝试已验证了该思路的可行性。SAIR旨在大幅拓展这一方向,提供海量经计算折叠的蛋白质-配体结构,以及与之对应的实验亲和力数值。我们的目标是填补这一数据空白,推动开发更精准、更鲁棒的结合亲和力预测机器学习模型。
# 更多相关资源
- [学术论文](https://go.sandboxaq.com/rs/175-UKR-711/images/sair_paper.pdf)
- [数据许可协议](https://storage.cloud.google.com/sandboxaq-sair/LICENSE.txt?authuser=0)
- [博客文章](https://www.sandboxaq.com/post/sair-the-structurally-augmented-ic50-repository)
# 联系我们
SandboxAQ团队推出SAIR,旨在革新药物发现领域,这只是我们的第一步。后续我们还将在药物开发全流程中推出更多数据集、人工智能模型与突破性解决方案。若您希望了解更多SAIR相关信息,或希望了解如何将该数据集及其训练模型拓展至您业务关注的特定靶标,欢迎与我们联系。请发送邮件至**[SAIR@sandboxaq.com](mailto:SAIR@sandboxaq.com)**。
# 数据集下载方法
以下为通过Python语言下载该数据集(包括.parquet文件与结构文件)的示例代码:
python
import os
import tarfile
from huggingface_hub import hf_hub_url, list_repo_files, hf_hub_download
from datasets import load_dataset
from tqdm import tqdm
import pandas as pd
def load_sair_parquet(destination_dir: str) -> pd.DataFrame:
"""
Downloads the sair.parquet file from the SandboxAQ/SAIR dataset and loads it
into a pandas DataFrame.
Args:
destination_dir (str): The local path where the parquet file will be
downloaded. The directory will be created if it
doesn't exist.
Returns:
pd.DataFrame: A pandas DataFrame containing the data from the
sair.parquet file.
"""
# --- 1. Setup and Repository Configuration ---
repo_id = "SandboxAQ/SAIR"
parquet_filename = "sair.parquet"
print(f"Targeting repository: {repo_id}")
print(f"Targeting file: {parquet_filename}")
print(f"Destination directory: {destination_dir}")
# Create the destination directory if it doesn't already exist
os.makedirs(destination_dir, exist_ok=True)
print(f"Ensured destination directory exists.")
# --- 2. Download the Parquet file from the Hugging Face Hub ---
download_path = os.path.join(destination_dir, parquet_filename)
print(f"
Downloading '{parquet_filename}'...")
try:
# Use hf_hub_download to get the file
hf_hub_download(
repo_id=repo_id,
filename=parquet_filename,
repo_type="dataset",
local_dir=destination_dir,
local_dir_use_symlinks=False,
)
print(f"Successfully downloaded to '{download_path}'")
except Exception as e:
print(f"An error occurred while downloading '{parquet_filename}': {e}")
return None
# --- 3. Load the Parquet file into a pandas DataFrame ---
try:
print(f"Loading '{parquet_filename}' into a pandas DataFrame...")
df = pd.read_parquet(download_path)
print("Successfully loaded DataFrame.")
return df
except Exception as e:
print(f"Failed to load parquet file '{download_path}': {e}")
return None
def download_and_extract_sair_structures(
destination_dir: str,
file_subset: list[str] = None,
cleanup: bool = True
):
"""
Downloads and extracts .tar.gz files from the SandboxAQ/SAIR dataset on Hugging Face.
This function connects to the specified Hugging Face repository, identifies all
.tar.gz files within the 'structures_compressed' directory, and downloads
and extracts them to a local destination. It can download either all files
or a specified subset.
Args:
destination_dir (str): The local path where the files will be downloaded
and extracted. The directory will be created if it
doesn't exist.
file_subset (list[str], optional): A list of specific .tar.gz filenames
to download. If None, all .tar.gz files
in the directory will be downloaded.
Defaults to None.
cleanup (bool, optional): If True, the downloaded .tar.gz archive will be
deleted after successful extraction. Defaults to True.
Raises:
ValueError: If any of the files specified in file_subset are not found
in the repository.
"""
# --- 1. Setup and Repository Configuration ---
repo_id = "SandboxAQ/SAIR"
repo_folder = "structures_compressed"
print(f"Targeting repository: {repo_id}")
print(f"Destination directory: {destination_dir}")
# Create the destination directory if it doesn't already exist
os.makedirs(destination_dir, exist_ok=True)
print(f"Ensured destination directory exists.")
# --- 2. Get the list of relevant files from the Hugging Face Hub ---
try:
all_files = list_repo_files(repo_id, repo_type="dataset")
# Filter for files within the specified folder that are tar.gz archives
repo_tars = [
f.split('/')[-1] for f in all_files
if f.startswith(repo_folder + '/') and f.endswith(".tar.gz")
]
print(f"Found {len(repo_tars)} total .tar.gz files in '{repo_folder}'.")
except Exception as e:
print(f"Error: Could not list files from repository '{repo_id}'. Please check the name and your connection.")
print(f"Details: {e}")
return
# --- 3. Determine which files to download ---
if file_subset:
# Validate that all requested files actually exist in the repository
invalid_files = set(file_subset) - set(repo_tars)
if invalid_files:
raise ValueError(f"The following requested files were not found in the repository: {list(invalid_files)}")
files_to_download = file_subset
print(f"A subset of {len(files_to_download)} files was specified for download.")
else:
files_to_download = repo_tars
print("No subset specified. All .tar.gz files will be downloaded.")
# --- 4. Download and Extract each file ---
for filename in tqdm(files_to_download, desc="Processing files"):
# Construct the full path within the repository
repo_filepath = f"{repo_folder}/{filename}"
download_path = os.path.join(destination_dir, repo_filepath)
print(f"
Downloading '{filename}'...")
try:
# Download the file from the Hub
hf_hub_download(
repo_id=repo_id,
filename=repo_filepath,
repo_type="dataset",
local_dir=destination_dir,
local_dir_use_symlinks=False,
)
print(f"Successfully downloaded to '{download_path}'")
# Extract the downloaded .tar.gz file
print(f"Extracting '{filename}'...")
with tarfile.open(download_path, "r:gz") as tar:
tar.extractall(path=destination_dir)
print(f"Successfully extracted contents to '{destination_dir}'")
except Exception as e:
print(f"An error occurred while processing '{filename}': {e}")
continue
finally:
# Clean up the downloaded archive if the flag is set and the file exists
if cleanup and os.path.exists(download_path):
os.remove(download_path)
print(f"Cleaned up (deleted) '{download_path}'")
print("
Operation completed.")
if __name__ == '__main__':
# --- Download the parquet dataset ---
# Define a destination for the data
output_directory = "./sair_data"
# Call the function to download and load the data
sair_df = load_sair_parquet(destination_dir=output_directory)
# Check if the DataFrame was loaded successfully
if sair_df is not None:
print("
--- DataFrame Info ---")
sair_df.info()
print("
--- DataFrame Head ---")
print(sair_df.head())
# --- Download a specific subset of structure tarballs ---
print("--- Running Scenario 2: Download a specific subset ---")
# Define the specific files you want to download
# Replace this with None to download *all* structures
# (remember, this is >100 files of ~10GB each!)
subset_to_get = [
"sair_structures_1006049_to_1016517.tar.gz",
"sair_structures_100623_to_111511.tar.gz",
]
download_and_extract_sair_structures(destination_dir=output_directory, file_subset=subset_to_get)
提供机构:
maas
创建时间:
2025-09-04



