five

islamrokon/Test

收藏
hugging_face2023-11-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/islamrokon/Test
下载链接
链接失效反馈
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: question dtype: string - name: answer dtype: string - name: input_ids sequence: int32 - name: attention_mask sequence: int32 - name: labels sequence: int64 splits: - name: train num_bytes: 17012.625 num_examples: 14 - name: test num_bytes: 2430.375 num_examples: 2 download_size: 17101 dataset_size: 19443.0 --- # Dataset Card for "Test" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
islamrokon
原始信息汇总

数据集概述

配置

  • 默认配置
    • 训练数据
      • 路径:data/train-*
    • 测试数据
      • 路径:data/test-*

数据特征

  • 问题
    • 数据类型:字符串
  • 答案
    • 数据类型:字符串
  • 输入ID
    • 数据类型:整数序列(int32)
  • 注意力掩码
    • 数据类型:整数序列(int32)
  • 标签
    • 数据类型:整数序列(int64)

数据分割

  • 训练集
    • 字节数:17012.625
    • 样本数:14
  • 测试集
    • 字节数:2430.375
    • 样本数:2

数据大小

  • 下载大小:17101字节
  • 数据集大小:19443.0字节
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4099个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

Global Firepower Index (GFI)

Global Firepower Index (GFI) 是一个评估全球各国军事力量的综合指数。该指数考虑了超过50个因素,包括军事预算、人口、陆地面积、海军力量、空军力量、自然资源、后勤能力、地理位置等。数据集提供了每个国家的详细评分和排名,帮助分析和比较各国的军事实力。

www.globalfirepower.com 收录

中国沿海地面沉降数据数据库(2015,2020)

数据来源https://doi.org/10.1038/s41467-022-34525-w。由于缺少详细公开的中国沿海地区地面升降数据,本数据集是通过系统的文献综述,对中国沿海地区的地面升降进行了详细整理,建立了中国沿海城市地面沉降数据库,包括中国沿海城市不同时段沉降速率以及测量手段。所用文献主要来源于CNKI文献数据库、Web of Science,以及灰色文献(Grey literature),如报告、规划等。该数据可用于相对海平面变化研究,沿海极值水位和海岸洪水危险性研究,为沿海地面沉降防控以及沿海适应性措施规划等提供参考和依据。

国家青藏高原科学数据中心 收录

GenshinVoice

GenshinVoice是一个包含原神游戏中所有语音文件及其对应文字文本的数据集。数据集直接从游戏中提取,包含多种语言版本,用于学习和研究目的。

github 收录

aqcat25

<h1 align="center" style="font-size: 36px;"> <span style="color: #FFD700;">AQCat25 Dataset:</span> Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis </h1> ![datset_schematic](https://cdn-uploads.huggingface.co/production/uploads/67256b7931376d3bacb18de0/W1Orc_AmSgRez5iKH0qjC.jpeg) This repository contains the **AQCat25 dataset**. AQCat25-EV2 models can be accessed [here](https://huggingface.co/SandboxAQ/aqcat25-ev2). The AQCat25 dataset provides a large and diverse collection of **13.5 million** DFT calculation trajectories, encompassing approximately 5K materials and 47K intermediate-catalyst systems. It is designed to complement existing large-scale datasets by providing calculations at **higher fidelity** and including critical **spin-polarized** systems, which are essential for accurately modeling many industrially relevant catalysts. Please see our [website](https://www.sandboxaq.com/aqcat25) and [paper](https://cdn.prod.website-files.com/622a3cfaa89636b753810f04/68ffc1e7c907b6088573ba8c_AQCat25.pdf) for more details about the impact of the dataset and [models](https://huggingface.co/SandboxAQ/aqcat25-ev2). ## 1. AQCat25 Dataset Details This repository uses a hybrid approach, providing lightweight, queryable Parquet files for each split alongside compressed archives (`.tar.gz`) of the raw ASE database files. More details can be found below. ### Queryable Metadata (Parquet Files) A set of Parquet files provides a "table of contents" for the dataset. They can be loaded directly with the `datasets` library for fast browsing and filtering. Each file contains the following columns: | Column Name | Data Type | Description | Example | | :--- | :--- | :--- | :--- | | `frame_id` | string | **Unique ID for this dataset**. Formatted as `database_name::index`. | `data.0015.aselmdb::42` | | `adsorption_energy`| float | **Key Target**. The calculated adsorption energy in eV. | -1.542 | | `total_energy` | float | The raw total energy of the adslab system from DFT (in eV). | -567.123 | | `fmax` | float | The maximum force magnitude on any single atom in eV/Å. | 0.028 | | `is_spin_off` | boolean | `True` if the system is non-magnetic (VASP ISPIN=1). | `false` | | `mag` | float | The total magnetization of the system (µB). | 32.619 | | `slab_id` | string | Identifier for the clean slab structure. | `mp-1216478_001_2_False` | | `adsorbate` | string | SMILES or chemical formula of the adsorbate. | `*NH2N(CH3)2` | | `is_rerun` | boolean | `True` if the calculation is a continuation. | `false` | | `is_md` | boolean | `True` if the frame is from a molecular dynamics run. | `false` | | `sid` | string | The original system ID from the source data. | `vadslabboth_82` | | `fid` | integer | The original frame index (step number) from the source VASP calculation. | 0 | --- #### Understanding `frame_id` and `fid` | Field | Purpose | Example | | :--- | :--- | :--- | | `fid` | **Original Frame Index**: This is the step number from the original VASP relaxation (`ionic_steps`). It tells you where the frame came from in its source simulation. | `4` (the 5th frame of a specific VASP run) | | `frame_id` | **Unique Dataset Pointer**: This is a new ID created for this specific dataset. It tells you exactly which file (`data.0015.aselmdb`) and which row (`101`) to look in to find the full atomic structure. | `data.0015.aselmdb::101` | --- ## Downloadable Data Archives The full, raw data for each split is available for download in compressed `.tar.gz` archives. The table below provides direct download links. The queryable Parquet files for each split can be loaded directly using the `datasets` library as shown in the "Example Usage" section. The data currently available for download (totaling ~11.1M frames, as listed in the table below) is the initial dataset version (v1.0) released on September 10, 2025. The 13.5M frame count mentioned in our paper and the introduction includes additional data used to rebalance non-magnetic element systems and add a low-fidelity spin-on dataset. These new data splits will be added to this repository soon. | Split Name | Structures | Archive Size | Download Link | | :--- | :--- | :--- | :--- | | ***In-Domain (ID)*** | | | | | Train | `7,386,750` | `23.8 GB` | [`train_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/train_id.tar.gz) | | Validation | `254,498` | `825 MB` | [`val_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_id.tar.gz) | | Test | `260,647` | `850 MB` | [`test_id.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_id.tar.gz) | | Slabs | `898,530` | `2.56 GB` | [`id_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/id_slabs.tar.gz) | | ***Out-of-Distribution (OOD) Validation*** | | | | | OOD Ads (Val) | `577,368` | `1.74 GB` | [`val_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_ads.tar.gz) | | OOD Materials (Val) | `317,642` | `963 MB` | [`val_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_mat.tar.gz) | | OOD Both (Val) | `294,824` | `880 MB` | [`val_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_both.tar.gz) | | OOD Slabs (Val) | `28,971` | `83 MB` | [`val_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/val_ood_slabs.tar.gz) | | ***Out-of-Distribution (OOD) Test*** | | | | | OOD Ads (Test) | `346,738` | `1.05 GB` | [`test_ood_ads.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_ads.tar.gz) | | OOD Materials (Test) | `315,931` | `993 MB` | [`test_ood_mat.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_mat.tar.gz) | | OOD Both (Test) | `355,504` | `1.1 GB` | [`test_ood_both.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_both.tar.gz) | | OOD Slabs (Test) | `35,936` | `109 MB` | [`test_ood_slabs.tar.gz`](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset/resolve/main/test_ood_slabs.tar.gz) | --- ## 2. Dataset Usage Guide This guide outlines the recommended workflow for accessing and querying the AQCat25 dataset. ### 2.1 Initial Setup Before you begin, you need to install the necessary libraries and authenticate with Hugging Face. This is a one-time setup. ```bash pip install datasets pandas ase tqdm requests huggingface_hub ase-db-backends ``` **1. Create a Hugging Face Account:** If you don't have one, create an account at [huggingface.co](https://huggingface.co/join). **2. Create an Access Token:** Navigate to your **Settings -> Access Tokens** page or click [here](https://huggingface.co/settings/tokens). Create a new token with at least **`read`** permissions. Copy this token to your clipboard. **3. Log in via the Command Line:** Open your terminal and run the following command: ```bash hf auth login ``` ### 2.2 Get the Helper Scripts You may copy the scripts directly from this repository, or download them by running the following in your local python environment: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="SandboxAQ/aqcat25", repo_type="dataset", allow_patterns=["scripts/*", "README.md"], local_dir="./aqcat25" ) ``` This will create a local folder named aqcat25 containing the scripts/ directory. ### 2.3 Download Desired Dataset Splits Data splits may be downloaded directly via the Hugging Face UI, or via the `download_split.py` script (found in `aqcat25/scripts/`). ```bash python aqcat25/scripts/download_split.py --split val_id ``` This will download `val_id.tar.gz` and extract it to a new folder named `aqcat_data/val_id/`. ### 2.4 Query the Dataset Use the `query_aqcat.py` script to filter the dataset and extract the specific atomic structures you need. It first queries the metadata on the Hub and then extracts the full structures from your locally downloaded files. **Example 1: Find all CO and OH structures in the test set:** ```bash python aqcat25/scripts/query_aqcat.py \ --split test_id \ --adsorbates "*CO" "*OH" \ --data-root ./aqcat_data/test_id ``` **Example 2: Find structures on metal slabs with low adsorption energy:** ```bash python aqcat25/scripts/query_aqcat.py \ --split val_ood_both \ --max-energy -2.0 \ --material-type nonmetal \ --magnetism magnetic \ --data-root ./aqcat_data/val_ood_both \ --output-file low_energy_metals.extxyz ``` **Example 3: Find CO on slabs containing both Ni AND Se with adsorption energy between -2.5 and -1.5 eV with a miller index of 011** ```bash python aqcat25/scripts/query_aqcat.py \ --split val_ood_ads \ --adsorbates "*COCH2OH" \ --min-energy -2.5 \ --max-energy -1.5 \ --contains-elements "Ni" "Se" \ --element-filter-mode all \ --facet 011 \ --data-root ./aqcat_data/val_ood_ads \ --output-file COCH2OH_on_ni_and_se.extxyz ``` --- ## 3. How to Cite If you use the AQCat25 dataset or the models in your research, please cite the following paper: ``` Omar Allam, Brook Wander, & Aayush R. Singh. (2025). AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis. arXiv preprint arXiv:XXXX.XXXXX. ``` ### BibTeX Entry ```bibtex @article{allam2025aqcat25, title={{AQCat25: Unlocking spin-aware, high-fidelity machine learning potentials for heterogeneous catalysis}}, author={Allam, Omar and Wander, Brook and Singh, Aayush R}, journal={arXiv preprint arXiv:2510.22938}, year={2025}, eprint={2510.22938}, archivePrefix={arXiv}, primaryClass={cond-mat.mtrl-sci} } ```

魔搭社区 收录

TongueDx Dataset

TongueDx数据集是一个专为远程舌诊研究设计的综合性舌象图像数据集,由香港理工大学和新加坡管理大学的研究团队创建。该数据集包含5109张图像,涵盖了多种环境条件下的舌象,图像通过智能手机和笔记本电脑摄像头采集,具有较高的多样性和代表性。数据集不仅包含舌象图像,还提供了详细的舌面属性标注,如舌色、舌苔厚度等,并附有受试者的年龄、性别等人口统计信息。数据集的创建过程包括图像采集、舌象分割、标准化处理和多标签标注,旨在解决远程医疗中舌诊图像质量不一致的问题。该数据集的应用领域主要集中在远程医疗和中医诊断,旨在通过自动化技术提高舌诊的准确性和可靠性。

arXiv 收录