DeepLearnPhysics/PILArNet-M

Name: DeepLearnPhysics/PILArNet-M
Creator: DeepLearnPhysics
Published: 2025-12-02 07:29:48
License: 暂无描述

Hugging Face2025-12-02 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/DeepLearnPhysics/PILArNet-M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - image-segmentation - object-detection tags: - particle - physics - 3D - simulation - lartpc - pointcloud pretty_name: >- Public Dataset for Particle Imaging Liquid Argon Detectors in High Energy Physics - Medium size_categories: - 1M<n<10M --- # Public Dataset for Particle Imaging Liquid Argon Detectors in High Energy Physics We provide the 168 GB **PILArNet-Medium** dataset, a continuation of the [PILArNet](https://arxiv.org/abs/2006.01993) dataset, consisting of ~1.2 million events from liquid argon time projection chambers ([LArTPCs](https://www.symmetrymagazine.org/article/october-2012/time-projection-chambers-a-milestone-in-particle-detector-technology?language_content_entity=und)). Each event contains 3D ionization trajectories of particles as they traverse the detector. Typical downstream tasks include: - Semantic segmentation of voxels into particle-like categories - Particle-level (instance-level) segmentation and identification - Interaction-level grouping of particles that belong to the same interaction ## Directory structure The dataset is stored in HDF5 format and organized as: ```plaintext /path/to/dataset/ /train/ /generic_v2_196200_v2.h5 /generic_v2_153600_v1.h5 ... /val/ /generic_v2_66800_v2.h5 ... /test/ /generic_v2_50000_v1.h5 ... ```` The number preceding the second `v2` indicates the number of events contained in the file. Dataset split: * **Train:** 1,082,400 events * **Validation:** 66,800 events * **Test:** 50,000 events ## Data format Each HDF5 file contains three main datasets: `point`, `cluster`, and `cluster_extra`. Entries are stored as variable length 1D arrays and should be reshaped event by event. ### `point` dataset Each entry of `point` corresponds to a single event and encodes all spacepoints for that event in a flattened array. After reshaping, each row corresponds to a point: Shape per event: `(N, 8)` Columns (per point): 1. `x` coordinate (integer voxel index, 0 to 768) 2. `y` coordinate (integer voxel index, 0 to 768) 3. `z` coordinate (integer voxel index, 0 to 768) 4. Voxel value (in MeV) 5. Energy deposit `dE` (in MeV) 6. Absolute time in nanoseconds 7. Number of electrons 8. `dx` in millimeters Example: ```python import h5py EVENT_IDX = 0 with h5py.File("/path/to/dataset/train/generic_v2_196200_v2.h5", "r") as h5f: point_flat = h5f["point"][EVENT_IDX] points = point_flat.reshape(-1, 8) # (N, 8) ``` ### `cluster` dataset Each entry of `cluster` corresponds to the set of clusters for a single event. After reshaping, each row corresponds to a cluster: Shape per event: `(M, 6)` Columns (per cluster): 1. Number of points in the cluster 2. Fragment ID 3. Group ID 4. Interaction ID 5. Semantic type (class ID, see below) 6. Particle ID (PID, see below) Example: ```python with h5py.File("/path/to/dataset/train/generic_v2_196200_v2.h5", "r") as h5f: cluster_flat = h5f["cluster"][EVENT_IDX] clusters = cluster_flat.reshape(-1, 6) # (M, 6) ``` ### `cluster_extra` dataset Each entry of `cluster_extra` provides additional per-cluster information for a single event. After reshaping, each row corresponds to a cluster: Shape per event: `(M, 5)` Columns (per cluster): 1. Particle mass (from PDG) 2. Particle momentum (magnitude) 3. Particle vertex `x` coordinate 4. Particle vertex `y` coordinate 5. Particle vertex `z` coordinate Example: ```python with h5py.File("/path/to/dataset/train/generic_v2_196200_v2.h5", "r") as h5f: cluster_extra_flat = h5f["cluster_extra"][EVENT_IDX] cluster_extra = cluster_extra_flat.reshape(-1, 5) # (M, 5) ``` ### Cluster and point ordering Points in the `point` array are ordered by the cluster they belong to. For a given event: * Let `clusters[i, 0]` be the number of points in cluster `i` * Then points for cluster `0` occupy the first `clusters[0, 0]` rows in `points` * Points for cluster `1` occupy the next `clusters[1, 0]` rows, and so on This ordering allows you to map cluster-level attributes (`cluster` and `cluster_extra`) back to the underlying points. ### Removing low energy deposits (LED) By construction, the first cluster in each event (`cluster[0]`) corresponds to amorphous low energy deposits or blips: these are treated as uncountable "stuff" and labeled as LED. To remove LED points from an event: ```python EVENT_IDX = 0 with h5py.File("/path/to/dataset/train/generic_v2_196200_v2.h5", "r") as h5f: point_flat = h5f["point"][EVENT_IDX] cluster_flat = h5f["cluster"][EVENT_IDX] points = point_flat.reshape(-1, 8) # (N, 8) clusters = cluster_flat.reshape(-1, 6) # (M, 6) # Number of points belonging to LED (cluster 0) n_led_points = clusters[0, 0] # Drop LED points points_no_led = points[n_led_points:] # points belonging to non-LED clusters ``` LED clusters also have special values in the ID fields, described in the label schema below. ## Label schema This section summarizes the label conventions used in the dataset for semantic segmentation, particle identification, and instance or interaction level grouping. ### Semantic segmentation classes Semantic labels are given by the field in `cluster[:, 4]`. The mapping is: | Semantic ID | Class name | | ----------- | ---------- | | 0 | Shower | | 1 | Track | | 2 | Michel | | 3 | Delta | | 4 | LED | Here, LED denotes low energy deposits or amorphous "stuff" that is not counted as a particle instance. To perform semantic segmentation at the point level, use the cluster ordering: 1. Expand cluster semantic labels to per-point labels according to the point counts per cluster. 2. Optionally remove LED points (Semantic ID 4) as shown above. ### Particle identification (PID) labels Particle identification uses the Particle ID field in `cluster[:, 5]`. The mapping is: | ID | Particle type | | --- | ---------------------------------- | | 0 | Photon | | 1 | Electron | | 2 | Muon | | 3 | Pion | | 4 | Proton | | 5 | Kaon (not present in this dataset) | | 6 | None (LED) | LED clusters that correspond to low energy deposits use `PID = 6`. These clusters are typically also `Semantic ID = 4` and treated as "stuff". ### Instance and interaction IDs The `cluster` dataset contains several integer IDs to support different grouping granularities: * **Fragment ID** (`cluster[:, 1]`): Identifies contiguous fragments of a particle. Multiple fragments may belong to the same particle. * **Group ID** (`cluster[:, 2]`): Identifies particle-level instances. All clusters with the same group ID correspond to the same physical particle. * Use `Group ID` for particle instance segmentation or particle-level identification tasks. * **Interaction ID** (`cluster[:, 3]`): Identifies interaction-level groups. All particles with the same interaction ID belong to the same interaction (for example a neutrino interaction and its secondaries). * Use `Interaction ID` for interaction-level segmentation or classification. For LED clusters, all three IDs * Fragment ID * Group ID * Interaction ID are set to `-1`. This differentiates LED clusters from genuine particle or interaction instances. ## Reconstruction Tasks Typical uses of this dataset include: * **Semantic segmentation**: Predict voxelwise semantic labels (shower, track, Michel, delta, LED) using the `Semantic type` field. * **Particle-level segmentation and PID**: * Use `Group ID` to define particle instances. * Use `PID` to assign particle type (photon, electron, muon, pion, proton, None). * **Interaction-level reconstruction**: * Use `Interaction ID` to group particles belonging to the same physics interaction. * Use `cluster_extra` for per-particle momentum and vertex information. ## Getting started A [Colab notebook](https://colab.research.google.com/drive/1x8WatdJa5D7Fxd3sLX5XSJiMkT_sG_im) is provided for a hands-on introduction to loading and inspecting the dataset. ## Citation ```bibtex @misc{young2025particletrajectoryrepresentationlearning, title={Particle Trajectory Representation Learning with Masked Point Modeling}, author={Sam Young and Yeon-jae Jwa and Kazuhiro Terao}, year={2025}, eprint={2502.02558}, archivePrefix={arXiv}, primaryClass={hep-ex}, doi={10.48550/arXiv.2502.02558}, url={https://arxiv.org/abs/2502.02558}, } ```

提供机构：

DeepLearnPhysics

5,000+

优质数据集

54 个

任务类型

进入经典数据集