TBHubbard Dataset

DataCite Commons2025-03-19 更新2025-04-15 收录

下载链接：

https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/ZKLRLF

下载链接

链接失效反馈

资源简介：

<h1 id="readme">README</h1> <p>The <strong>TBHubbard dataset</strong> is a collection of data with a tight-binding description of metal organic frameworks (MOFs). The structures are derived from the QMOF database, where first-principles calculations are performed to obtain the electronic density which is then projected onto a localized atomic basis set using PAOFLOW. The data collection is divided into two sub-sets: <code>tight_binding_model</code> and <code>extended_hubbard_model</code>.</p> <h2 id="downloading-the-dataset">Downloading the dataset</h2> <p>To facilite the download of this massive dataset, we provide the <code>2-DOWNLOAD.sh</code> script for your convenience. Instead of downloading the compressed files from this page one by one, we suggest you download the script first and use it to obtain all the other files.</p> <pre><code class="language-Shell">$ ./2-DOWNLOAD.sh -h Usage: ./2-DOWNLOAD.sh -s {TB, EH} -c {aria2c, wget, curl} -d destination/ </code></pre> <ul> <li>You can use the <code>-c</code> option to select a download client from these options: <a href="https://aria2.github.io/">aria2c</a>, <a href="https://www.gnu.org/software/wget/">wget</a> and <a href="https://curl.se/">curl</a>. For faster downloads supporting multiple parallel connections, we recommend <code>aria2c</code>.</li> <li>To download only the Tight-Binding (<code>TB</code>) or the Extended Hubbard (<code>EH</code>) sub-set, please use the <code>-s</code> option.</li> <li>The script, by default, downloads all files to a destination folder named <code>TBHubbard/</code>. You can change the destination folder using the <code>-d</code> option.</li> </ul> <p>After downloading the files, use the following commands to decompress them:</p> <pre><code class="language-Shell">cat tight_binding_model.tar.bz2.part-* | tar -I pbzip2 -xvf - cat extended_hubbard_model.tar.bz2.part-* | tar -I pbzip2 -xvf - </code></pre> <h2 id="sub-sets">Sub-sets</h2> <h3 id="tight-binding-model">Tight-Binding Model</h3> <p>The <strong>Tight-Binding Model</strong> offers a comprehensive dataset for 10,435 metal-organic frameworks (MOFs), providing key electronic structure data. The electronic density for each MOF is projected onto a localized atomic basis set, generating a tight-binding lattice Hamiltonian. This allows for the study of the electronic properties and interactions within the MOF structures. Additionally, Smooth Overlap of Atomic Positions (SOAP) descriptors are computed for 20,375 MOFs, enriching the dataset with detailed topology information about the local atomic environments.</p> <ul> <li><p><strong><code>tb_dft/</code></strong>: This directory contains the Quantum ESPRESSO (QE) calculations used for the tight-binding projections. It includes all relevant input and output files, the tight-binding Hamiltonian, and detailed results from PAOFLOW projections (e.g., <code>arry.pkl</code>, <code>paoflow.out</code>). The <code>bader.out</code>, <code>ACF.dat</code>, and other related files provide further insights into the charge distribution and electronic structure. SCF calculation outputs such as <code>rho.cube</code> and <code>scf</code> files are also included to allow for a deeper understanding of the electronic density. For detailed instructions on the tight-binding projection workflow, please refer to the <a href="tight_binding_model/README.md"><strong><code>tight_binding_model/README.md</code></strong></a>.</p> </li> <li><p><strong><code>soap_of_mofs/</code></strong>: This folder includes the SOAP descriptors, which are essential for understanding the local atomic environments within the MOFs. SOAP descriptors come in two variations: <strong>SOAP-3 Å</strong> and <strong>SOAP-5 Å</strong>. These descriptors capture the atomic structure at different length scales, offering both detailed and broader topological information. The filenames are given appending to the MOFs name the suffix <code>_soap.npz</code>. Each file contains these descriptors and allows for easy extraction of essential data. For further information on computing SOAP descriptors, please refer to the <a href="tight_binding_model/scripts/compute_soap-descriptors/README.md"><strong><code>tight_binding_model/scripts/compute_soap-descriptors/README.md</code></strong></a>.</p> </li> <li><p><strong><code>scripts/</code></strong>: A collection of helper tools for visualizing the data and generating necessary inputs for further analysis. These scripts make it easier to manipulate, visualize, and utilize the tight-binding and SOAP data for subsequent computational studies and modeling. To learn how to compute tight-binding embeddings from QE, check the <a href="tight_binding_model/scripts/compute_tight-binding_embeddings_from_qe/README.md"><strong><code>tight_binding_model/scripts/compute_tight-binding_embeddings_from_qe/README.md</code></strong></a>. To set up and run SCF calculations with QE, follow the instructions in <a href="tight_binding_model/scripts/setup_qe_scf/README.md"><strong><code>tight_binding_model/scripts/setup_qe_scf/README.md</code></strong></a>.</p> </li> </ul> <p>For more detailed guidance, please refer to the appropriate <code>README.md</code> files in each directory.</p> <h3 id="extended-hubbard-model">Extended Hubbard Model</h3> <p>Electronic structure calculations for 242 MOFs. The electronic density is projected onto a localized atomic basis set, providing a tight-binding lattice Hamiltonian of MOFs. A set of 428 calculations are also provided for the self-consistent computation of Hubbard parameters U and V of 242 MOFs. The set is divided according to the manifold chosen for U and V, where d and s orbitals corresponds to <code>ds_perturbations</code>; and d and p orbitals corresponds to <code>dp_perturbations</code>. The tight-binding projection along with the Hubbard parameters constructs the Extended Hubbard model lattice Hamiltonian.</p> <ul> <li><strong><code>dp_perturbations</code></strong> and <strong><code>ds_perturbations</code></strong>: QE calculation with the tight-binding projection, including input and output files, as well as the tight-binding Hamiltonian (<code>arry.pkl</code>). The Hubbard parameters computation inputs and outputs (<code>hp.p</code>) are also provided for dp and ds perturbations in each corresponding folder (<code>Hubbard_parameters_full.dat</code> and <code>Hubbard_parameters_nn.dat</code>).</li> <li><strong><code>extend_hubb_data.json</code></strong>: Tabulated property containing the main input and output QE information for each MOF, divided in dp and ds perturbations.</li> <li><strong><code>scripts/</code></strong>: helper tools for visualization of main properties and input generation.</li> </ul> <h2 id="license">License</h2> <p>All dataset files are distributed under the <a href="https://cdla.dev/permissive-2-0/"><code>CDLA-Permissive-2.0</code></a> license, while the source code files are distributed under the <a href="https://opensource.org/license/bsd-3-clause"><code>BSD-3-Clause</code></a> license.</p> <p><strong>Copyright (c) 2025, International Business Machines All rights reserved.</strong></p>

提供机构：

Harvard Dataverse

创建时间：

2025-03-07

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

AISHELL/AISHELL-1

Aishell是一个开源的中文普通话语音语料库，由北京壳壳科技有限公司发布。数据集包含了来自中国不同口音地区的400人的录音，录音在安静的室内环境中使用高保真麦克风进行，并下采样至16kHz。通过专业的语音标注和严格的质量检查，手动转录的准确率超过95%。该数据集免费供学术使用，旨在为语音识别领域的新研究人员提供适量的数据。

hugging_face 收录

UniProt

UniProt（Universal Protein Resource）是全球公认的蛋白质序列与功能信息权威数据库，由欧洲生物信息学研究所（EBI）、瑞士生物信息学研究所（SIB）和美国蛋白质信息资源中心（PIR）联合运营。该数据库以其广度和深度兼备的蛋白质信息资源闻名，整合了实验验证的高质量数据与大规模预测的自动注释内容，涵盖从分子序列、结构到功能的全面信息。UniProt核心包括注释详尽的UniProtKB知识库（分为人工校验的Swiss-Prot和自动生成的TrEMBL），以及支持高效序列聚类分析的UniRef和全局蛋白质序列归档的UniParc。其卓越的数据质量和多样化的检索工具，为基础研究和药物研发提供了无可替代的支持，成为生物学研究中不可或缺的资源。

www.uniprot.org 收录

Traditional-Chinese-Medicine-Dataset-SFT

该数据集是一个高质量的中医数据集，主要由非网络来源的内部数据构成，包含约1GB的中医各个领域临床案例、名家典籍、医学百科、名词解释等优质内容。数据集99%为简体中文内容，质量优异，信息密度可观。数据集适用于预训练或继续预训练用途，未来将继续发布针对SFT/IFT的多轮对话和问答数据集。数据集可以独立使用，但建议先使用配套的预训练数据集对模型进行继续预训练后，再使用该数据集进行进一步的指令微调。数据集还包含一定比例的中文常识、中文多轮对话数据以及古文/文言文<->现代文翻译数据，以避免灾难性遗忘并加强模型表现。

huggingface 收录

VisDrone2019

VisDrone2019数据集由AISKYEYE团队在天津大学机器学习和数据挖掘实验室收集，包含288个视频片段共261,908帧和10,209张静态图像。数据集覆盖了中国14个不同城市的城市和乡村环境，包括行人、车辆、自行车等多种目标，以及稀疏和拥挤场景。数据集使用不同型号的无人机在各种天气和光照条件下收集，手动标注了超过260万个目标边界框，并提供了场景可见性、对象类别和遮挡等重要属性。

github 收录

FER2013

FER2013数据集是一个广泛用于面部表情识别领域的数据集，包含28,709个训练样本和7,178个测试样本。图像属性为48x48像素，标签包括愤怒、厌恶、恐惧、快乐、悲伤、惊讶和中性。

github 收录