ToyMix, LargeMix, UltraLarge
收藏arXiv2023-10-18 更新2024-06-21 收录
下载链接:
https://zenodo.org/record/8372621, https://zenodo.org/record/8370548
下载链接
链接失效反馈官方服务:
资源简介:
本研究介绍了三个广泛且精心策划的多标签数据集,涵盖近100亿个分子和超过3000个稀疏定义的任务,总计超过130亿个单独的标签,是目前同类数据集中最大的。这些数据集专为监督训练基础模型而设计,结合了代表量子特性和生物特性的标签,这些标签通过模拟和湿实验室实验获得。标签也是多层次的,包括节点级和图级任务。标签的多样性促进了有效的迁移学习,并能够通过提高其对广泛下游分子建模任务的泛化能力来构建基础模型。
This study introduces three extensive and carefully curated multi-label datasets, covering nearly 10 billion molecules and over 3000 sparsely defined tasks, with a total of more than 13 billion individual labels, making it the largest among such datasets to date. These datasets are designed for supervised training of foundation models, and combine labels representing quantum and biological properties obtained through simulations and wet-lab experiments. The labels are also multi-level, including node-level and graph-level tasks. The diversity of these labels facilitates effective transfer learning, and enables the construction of foundation models with improved generalization capabilities across a wide range of downstream molecular modeling tasks.
提供机构:
Mila - Québec AI Institute
创建时间:
2023-10-06



