ChemData700K

Name: ChemData700K
Creator: AI4Chem
Published: 2026-05-17 04:30:53
License: 暂无描述

OpenDataLab2026-05-17 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/AI4Chem/ChemData700K

下载链接

链接失效反馈

官方服务：

资源简介：

ChemData700K 是一个包含了九项化学核心任务，730K个高质量问答的大语言模型化学能力指令微调数据集. ChemData是大模型语料数据联盟成员单位上海人工智能实验室（Shanghai Artificial Intelligence Laboratory）精心构建的大规模数据集，旨在为化学语言模型的微调提供支持，从而提高、释放其全部化学潜力。 ## 数据集来源为确保化学语言模型的有效性，获取多样化且高质量的数据集至关重要。因此，研究团队从大量的知名在线数据库中收集了海量化学数据，这其中包括了PubChem、ChEMBL、ChEBI、ZINC、USPTO、ORDerly、ChemXiv、LibreTexts Chemistry、Wikipedia和Wikidata等等。基于这一系列在线数据库，研究团队构建了ChemData数据集。 ## 数据集构成 ChemData包含了7,000,000条用于指令微调（Instruction Tuning）的问答对。同时，ChemData覆盖了广泛的化学领域专业知识，主要面向三种化学任务类型：分子（Molecules）、反应（Reactions）以及其它特定领域（Domain-specific）任务。 ● 分子（Molecules）具体而言，分子相关的任务包括名称转换（Name Conversion）、文生分子（Caption2Mol）、分子生文（Mol2Caption）和分子属性预测（Molecular Property Prediction），这些任务旨在优化、提升语言模型对化学分子的理解能力。 ● 反应（Reactions）与反应相关的任务涵盖了逆合成（Retrosynthesis）、产物预测（Product Prediction）、产率预测（Yield Prediction）、温度预测（Temperature Prediction）和溶剂预测（Solvent Prediction），涵盖了化学反应的各个方面。 ● 其它特定领域（Domain-specific）此外，所有无法明确分类的其他数据都归类为特定领域任务，这些数据提升了化学语言模型对整个化学领域的理解。

ChemData700K is a large language model (LLM) chemical capability instruction tuning dataset that covers nine core chemical tasks and contains 730K high-quality question-answer pairs. ChemData is a large-scale dataset meticulously constructed by the Shanghai Artificial Intelligence Laboratory, a member of the Large Model Corpus Data Alliance, aiming to support the instruction tuning of chemical language models, thereby enhancing and unleashing their full chemical potential. ## Dataset Source To ensure the effectiveness of chemical language models, acquiring diverse and high-quality datasets is critical. Therefore, the research team collected massive amounts of chemical data from numerous well-known online databases, including PubChem, ChEMBL, ChEBI, ZINC, USPTO, ORDerly, ChemXiv, LibreTexts Chemistry, Wikipedia, Wikidata, and more. Based on this collection of online databases, the research team constructed the ChemData dataset. ## Dataset Composition ChemData contains 7,000,000 question-answer pairs for instruction tuning. Meanwhile, ChemData covers a wide range of professional chemical domain knowledge, and mainly focuses on three types of chemical tasks: Molecules, Reactions, and other Domain-specific tasks. ● Molecules Specifically, molecule-related tasks include Name Conversion, Caption2Mol, Mol2Caption, and Molecular Property Prediction, which aim to optimize and improve the language model's understanding of chemical molecules. ● Reactions Reaction-related tasks cover Retrosynthesis, Product Prediction, Yield Prediction, Temperature Prediction, and Solvent Prediction, which encompass all aspects of chemical reactions. ● Other Domain-specific Tasks In addition, all other data that cannot be clearly classified are categorized as domain-specific tasks, which enhance the chemical language model's understanding of the entire chemical domain.

提供机构：

AI4Chem

创建时间：

2024-04-23

搜集汇总

数据集介绍

背景与挑战

背景概述

ChemData700K是一个包含730K个高质量问答对的大规模指令微调数据集，专为提升大语言模型在化学领域的性能而构建。它覆盖了九项化学核心任务，包括分子、反应和其他特定领域任务，数据来源于多个知名化学数据库以确保多样性和准确性。该数据集由上海人工智能实验室发布，旨在支持化学语言模型的微调，释放其化学潜力。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集