DCAIKU催化数据库
收藏国家基础学科公共科学数据中心2024-03-05 收录
下载链接:
https://www.nbsdc.cn/general/dataDetail?id=64edc645bb16e07753c3426f&type=1
下载链接
链接失效反馈官方服务:
资源简介:
DACAIKU催化数据库的数据采集基于自研的分布式化学数据采集系统。分布式化学数据采集系统是一种基于分布式框架的高效率、可拓展的化学数据采集方案,结合节点监控策略,确保采集过程的稳定性与可靠性。系统主要分为数据采集节点、中心服务节点、任务调度程序以及对象存储系统,每个数据采集节点均部署在不同的物理服务器上,可从指定的数据源中采集数据。1、采集流程如下:(1)通过在中心服务节点上提出数据采集任务。(2)分析任务内容后,任务调度器基于Hash Load Balance算法,将采集任务分发至不同的数据采集节点中,中心服务节点监控不同采集节点的任务执行情况,数据节点中使用多线程并行技术进行数据的采集。(3)采集的数据最终存放在对象存储系统中,可通过中心服务节点,从对象存储中批量下载采集得到的数据。2、通常数据采集伴随着数据清洗,一方面系统提供下载接口批量下载数据到本地后,可对数据进行本地化的数据清洗工作,另一方面则可以通过系统的数据清洗任务接口进行数据清洗。采集系统提供了数据清洗接口,因此在基础的采集任务完成后,针对采集后的数据,编写了Python数据清洗脚本,通过接口提交数据清洗任务,最终将各采集节点上的采集数据进行数据清洗操作,最后按照需求经过计算验证与过滤筛选,最后生成结果数据集。
The data collection for the DACAIKU Catalysis Database is built upon a self-developed distributed chemical data acquisition system. This distributed chemical data acquisition system is an efficient and scalable solution for chemical data collection, which integrates node monitoring strategies to guarantee the stability and reliability of the entire data collection process. The system comprises four core components: data collection nodes, central service nodes, task scheduler, and object storage system. Each data collection node is deployed on an independent physical server and capable of collecting data from designated data sources.
1. The data collection workflow is as follows:
(1) Submit a data collection task via the central service node.
(2) After analyzing the task content, the task scheduler distributes the collection tasks to different data collection nodes via the Hash Load Balance algorithm. The central service node monitors the task execution status of all collection nodes, while multi-threaded parallel technology is utilized in each data collection node to perform data collection.
(3) All collected data is ultimately stored in the object storage system, and users can batch-download the collected data from the storage via the central service node.
2. Data cleaning is typically integrated into the data collection pipeline. On one hand, after batch-downloading data to the local environment through the system's download interface, users can conduct local data cleaning operations; on the other hand, data cleaning can also be performed via the system's dedicated data cleaning task interface. The data collection system provides standardized data cleaning interfaces. Therefore, upon completion of basic collection tasks, users can write custom Python data cleaning scripts and submit data cleaning tasks through these interfaces. The collected data from each data collection node will then undergo unified data cleaning operations, followed by calculation verification and filtering based on specific requirements, ultimately generating the final result dataset.
提供机构:
中国科学技术大学
搜集汇总
数据集介绍

背景与挑战
背景概述
DCAIKU催化数据库是一个基于分布式化学数据采集系统构建的数据库,数据量达20.23GB,包含1006个文件,主要格式为json和docx。该数据库通过多线程并行技术和数据清洗接口,确保了数据采集的稳定性和可靠性,适用于化学领域的研究与应用。
以上内容由遇见数据集搜集并总结生成



