five

jessicafry/TIDMAD

收藏
Hugging Face2026-04-01 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/jessicafry/TIDMAD
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 DOI: 10.5281/zenodo.11458076 TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising --- # TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising <div style="border: 1px solid; padding: 10px;"> **DOI:** [10.5281/zenodo.11458076](https://doi.org/10.5281/zenodo.11458076) **Please see our github [https://github.com/jessicafry/TIDMAD](https://github.com/jessicafry/TIDMAD) for download scripts and benchmark procedures.** </div> ## Download Data The TIDMAD dataset can be downloaded using the download_data.py script provided in this GitHub repository. This script runs is a command-line utility for downloading the TIDMAD dataset, hosted on Hugging Face and the San Diego Supercomputer Center (SDSC) via the Pelican object store. The script supports flexible partial downloads, automatic source fallback, and a dry-run mode for inspecting commands before executing them. ### Requirements - **Python 3.7+** - [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) Python package (for Hugging Face downloads) - [`pelican`](https://pelicanplatform.org/) CLI (for SDSC downloads) --- ### Installation 1. **Clone or download** this repository, then navigate to the directory containing `download_data.py`. 2. **Install the Python dependency:** ```bash pip install huggingface_hub ``` 3. **(Optional) Install Pelican** if you plan to use SDSC as a download source: Follow the [Pelican installation guide](https://pelicanplatform.org/get-pelican) for your platform. 4. **Verify your setup:** ```bash python download_data.py -h ``` ### Usage ```bash python download_data.py [OPTIONS] ``` Running without any arguments will download the **full dataset** (20 training + 20 validation + 208 science files) to the current working directory using Hugging Face. You will be prompted to confirm before any files are downloaded. ### Arguments Reference | Argument | Short | Type | Default | Description | |---|---|---|---|---| | `--output_dir` | `-o` | `str` | Current directory | Directory where downloaded files will be saved. Must already exist. | | `--train_files` | `-t` | `int` (0–20) | `20` | Number of training files to download. | | `--validation_files` | `-v` | `int` (0–20) | `20` | Number of validation files to download. | | `--science_files` | `-s` | `int` (0–208) | `208` | Number of science files to download. | | `--source` | — | `auto`\|`hf`\|`sdsc` | `auto` | Download source (see [Download Sources](#download-sources)). | | `--force` | `-f` | flag | `False` | Skip the size summary and confirmation prompt. | | `--skip_downloaded` | `-sk` | flag | `False` | Skip files that already exist in `--output_dir`. | | `--weak` | `-w` | flag | `False` | Download the weak signal variant of training/validation files (see [Weak Signal Variant](#weak-signal-variant)). | | `--print` | `-p` | flag | `False` | Print the equivalent download commands without executing them (dry run). | ### Download Sources The script supports three download source modes, controlled by `--source`: #### `auto` (default) Attempts to download each file from **Hugging Face first**. If the Hugging Face download fails for any reason, it automatically retries using **SDSC via Pelican**. This is the recommended mode for most users. #### `hf` Downloads exclusively from the [Hugging Face dataset repository](https://huggingface.co/datasets/jessicafry/TIDMAD). No fallback is attempted. Requires the `huggingface_hub` package. #### `sdsc` Downloads exclusively from SDSC using the Pelican object store. Requires the `pelican` CLI to be installed and configured. ### Output File Naming All files are saved with zero-padded four-digit indices: ``` abra_training_0000.h5 ... abra_training_0019.h5 abra_validation_0000.h5 ... abra_validation_0019.h5 abra_science_0000.h5 ... abra_science_0207.h5 ``` All files are placed flat in `--output_dir` — no subdirectories are created. ### Dry-Run Mode Use `-p` / `--print` to preview the exact commands that would be executed, without downloading anything. This is useful for scripting, auditing, or manual downloading. ### Resuming Interrupted Downloads If a download is interrupted, use `--skip_downloaded` (`-sk`) on the next run. The script will check for existing files in `--output_dir` and skip any that are already present: ```bash python download_data.py -o /data/tidmad --skip_downloaded ``` > **Note:** The script checks for file existence only — it does not validate file integrity or size. If you suspect a file was partially downloaded, delete it manually before re-running. ### Weak Signal Variant Training and validation files come in two signal strength variants: - **Standard** (default): Files indexed `0000`–`0019`. The injected signal is at full amplitude. - **Weak signal** (`--weak`): Files indexed `0020`–`0039`. The injected signal is **1/5 the amplitude** of the standard variant, intended for more challenging detection benchmarks. Use `--weak` to download the weak signal versions instead of the standard ones: ```bash python download_data.py --weak -s 0 ``` The weak signal flag only affects training and validation files. Science files are unaffected. ## Dataset Composition: The dataset includes 248 files (288 if the weak signal version is included), all in HDF5 format. Dataset composition is specified in `TIDMAD_croissant.json`. The dataset is partitioned into three subsets: 1. Training Dataset: `abra_training_00{##}.h5` where ## varies from 00 to 19 * Each training `.h5` file has the following format:<br><img width="274" alt="Screenshot 2024-06-02 at 6 49 50 PM" src="https://github.com/aobol/TIDMAD/assets/25975621/0f99b6e6-2f7c-4566-91e8-cc29985f32c2"> 2. Training Dataset: `abra_validation_00{##}.h5` where ## varies from 00 to 19 * Each validation `.h5` file has the following format:<br><img width="279" alt="Screenshot 2024-06-02 at 6 51 55 PM" src="https://github.com/aobol/TIDMAD/assets/25975621/fe466977-fe9c-46c2-8186-11986ed7a3c0"> 2. Science Dataset: `abra_science_0{###}.h5` where ### varies from 000 to 207 * Each science `.h5` file has the following format:<br><img width="273" alt="Screenshot 2024-06-02 at 6 52 52 PM" src="https://github.com/aobol/TIDMAD/assets/25975621/17f8ae17-6942-4840-a092-a6d268fc2d83"> * For science files, there are no injected fake signal therefore only 1 channel is present **Caveat:** Due to a hardware condition, the size of channel0001 and channel0002 time series in a few training and validation files are not identical. This does not affect the sample-to-sample correspondence between the two channels except in the last few time samples. To get around this, we recommend only using the first 2,000,000,000 samples in both channels for all files (i.e. `ch01_time_series = ch01_time_series[:2000000000]`). ## Model Training and Benchmarking: TIDMAD users could follow the procedure below to reprocued the result in our paper: 1. Run `python download_data.py` script to download all datasets 2. Set up the required environment using `python setup.py install` 3. Train 3 deep learning models by running `python train.py -d [directory] -m [model]` * `[directory]` is where all the training files are downloaded to in step 1 * `[model]` is the deep learning model to train, user should choose from `[fcnet/punet/transformer]`. * Note: for each deep learning model, 4 files will be produced due to Frequency Splitting discussed in the paper. (i.e. for `-m fcnet` there will be 4 files including `FCNet_0_4.pth`, `FCNet_4_10.pth`, `FCNet_10_15.pth`, `FCNet_15_20.pth` * Alternatively, users could download our trained model at Google Drive [Link](https://drive.google.com/drive/folders/16ORX1b2zo1_lOYYAcRBgddBuYImj0Bxs?usp=sharing) 4. Run `python inference.py -d [directory] -m [model]` to produced denoised time series file in `.h5` format * `[directory]` is where all the validation files are downloaded to in step 1. * For each validation file `abra_validation_00{##}.h5`, a denoised validation file `abra_validation_denoised_[model]_00{##}.h5` will be generated. Please note that the denoised validation file will also be saved at `[directory]`. * `[model]` is the denoising algorithm to run inference over, user should choose from `[mavg/savgol/fcnet/punet/transformer]`. If user choose one of `[fcnet/punet/transformer]`, the trained model file in `.pth` format must be present at current working directory. These `.pth` file can be generated following step 3 or downloaded directly. 5. Run `python denoising_score.py -d [directory] -m [model]` * `[directory]` is where all the validation files are downloaded to in step 1. * `[model]` is the denoising algorithm used in step 4, user should choose from `[none/mavg/savgol/fcnet/punet/transformer]`. `none` calculates the denoising score for raw SQUID time series without any denoising. If any model other than `none` is chosen, user must make sure that the corresponding `abra_validation_denoised_[model]_00{##}.h5` was successfully produced in step 4. * `python denoising_score.py` has additional arguments, including: * `-c --coarse`: calculate coarse denoising score instead of fine denoising score * `-p --parallel`: parallelize the runing of the score calculation script * `-w, --num_workers`: maximum number of workers allowed for the parallel processing 6. Run `python process_science_data.py -d [directory] -m [model]` to generate the denoised time series over the 208 science files provided. * `[directory]` is the directory of the input files. The file names should match the downloaded, raw science data files. Do not edit science file names. * `[model]` is one of the three deep learning models developed: `punet`, `fcnet`, or `transformer` * **Note** the corresponding `.pth` files must be in the same directory as the `process_science_data.py` program. * The denoised science data will be outputed with the following file names: * `denoised_[PUNet/FCNet/Transformer]_[0/4/10/15]_[4/10/15/20]ph_file_[0000-0207].h5` 7. Run `python brazilband.py [path] [files] [output file name (no extention)] --level coarse --v` to generate dark matter limit in `[outpt].csv` and brazil band plot in `[output].png`. * `brazilband.py` has arguments including: * `[path]` is the pathway to all of the input files listed in the `txt` file. * `[files]` is either the `.txt` file containing all of the `.h5` file names. Or, if psd averaging has been done, the `.npy` file containing `[freq, pwr]`. * `[out]` is the file name for output brazil band plot. Plot will be saved as `[out].png`. Data will be saved at `[out].csv`. If input file type is `.txt` the average psd will be saved in `[out].npy`. * `--level` is either 'coarse' or 'fine' for coarse or full axion mass points. Standard is 'coarse'. * `--v` for verbose option for logger and error messages. 8. Run `AxionPhoton_TIDMAD.ipynb` to produce the global Dark Matter Limit plot. This Jupyter notebook utilizes the plotting tools from [AxionLimits](https://github.com/cajohare/AxionLimits) along with a specific plotting function `PlotFuncs_TIDMAD.py` for this project. * **Note**: the denoised `.csv` files generated by step 7 must placed in the `limit_data` folder. The variable `denoised_ABRA_limit_file` in the jupyter notebook must be changed accordingly. * You must have `AxionPhoton_TIDMAD.ipynb`, `PlotFuncs_TIDMAD.py`, and `limit_data` (along with all of its contents) in the same directory for this script to run. ## Contact Please contact J. T. Fry for all questions about the code and data [jtfry@mit.edu](mailto:jtfry@mit.edu). ## Paper Abstract Dark matter makes up approximately 85\% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth as of today. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was meticulously designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search result widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal, if exist, would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present a comprehensive data release from the ABRACADABRA experiment including three key components: a ultra-long time series dataset divided into training, validation, and dark matter search subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which yield a community-standard dark matter search result suitable for publication in a physics journal. Our data release enables core AI algorithms to directly produce physics results thereby advancing fundamental science. ---
提供机构:
jessicafry
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作