LeoChen085/SlipDataset

Name: LeoChen085/SlipDataset
Creator: LeoChen085
Published: 2026-03-12 15:09:55
License: 暂无描述

Hugging Face2026-03-12 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LeoChen085/SlipDataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: license: mit task_categories: - sensor-language alignment - sensor-based classification language: - en tags: - time-series - sensor - multimodal - captioning - contrastive-learning size_categories: - 100K<n<1M --- # SlipDataset This repository contains the data released alongside [SLIP](https://github.com/yuc0805/SLIP) (Sensor Language-Informed Pretraining), a framework for learning language-aligned sensor representations that generalize across diverse sensor setups. It includes two components: **(1)** the pretraining corpus of 600K+ paired sensor time-series and hierarchical text captions, and **(2)** 11 downstream evaluation datasets spanning four sensor domains. ## Dataset Details ### Dataset Description **SLIP** aligns multivariate time-series sensor data with natural language via contrastive alignment and captioning objectives, enabling sensor classification, zero-shot retrieval, question answering, and captioning from a single pretrained checkpoint. This repository provides: - **Pretraining data** (`data/`): Over 600K sensor–caption pairs covering approximately one billion time points. Captions are generated at three levels of granularity — statistical, structural, and semantic — following the SensorLM recipe, with paraphrases generated by Qwen2-7B-IT to reduce template repetition. The corpus spans health, web, nature, energy, IoT, environment, and transport domains, with sampling rates from seconds to months. - **Evaluation data** (individual folders): 11 downstream sensor classification datasets used for linear-probing and zero-shot retrieval evaluation, covering activity recognition, clinical diagnosis, stress prediction, and urban sensing. - **Curated by:** Yuliang Chen, Arvind Pillai, Yu Yvonne Wu, Tess Z. Griffin, Lisa Marsch, Michael V. Heinz, Nicholas C. Jacobson, Andrew Campbell (Dartmouth College) - **Language(s):** English - **License:** MIT ### Dataset Sources - **Code:** [https://github.com/yuc0805/SLIP](https://github.com/yuc0805/SLIP) - **Paper:** *Learning Transferable Sensor Models via Language-Informed Pretraining* ## Repository Structure ``` SlipDataset/ ├── data/ # Pretraining corpus (~600K sensor–caption pairs, parquet shards) │ └── train-*.parquet ├── wisdm/ # Activity Recognition — Accelerometer (18 classes, 30Hz) ├── uci_har/ # Activity Recognition — Accelerometer + Gyroscope (7 classes, 50Hz) ├── PPG_CVA/ # Clinical Diagnosis — PPG for Stroke detection (2 classes, 65Hz) ├── PPG_DM/ # Clinical Diagnosis — PPG for Diabetes detection (2 classes, 65Hz) ├── PPG_HTN/ # Clinical Diagnosis — PPG for Hypertension (4 classes, 65Hz) ├── sleepEDF/ # Clinical Diagnosis — EEG Sleep staging (5 classes, 100Hz) ├── ptbxl/ # Clinical Diagnosis — 12-lead ECG Heart conditions (5 classes, 100Hz) ├── wesad/ # Stress Prediction — Multimodal chest+wrist sensors (3 classes, 700Hz) ├── studentlife/ # Stress Prediction — Phone + wearable sensors (3 classes, minute-level) ├── AsphaltObstacles/ # Urban Sensing — Acceleration magnitude (4 classes, 100Hz) ├── Beijing_AQI/ # Urban Sensing — Environmental sensors (4 classes, hourly) ├── meta.csv # Dataset metadata ├── dataset_info.json # HuggingFace dataset configuration └── state.json # Dataset state ``` ## Pretraining Data The pretraining corpus is assembled from community-released time-series corpora (UTSD, NormWear, Capture-24) augmented with synthetic pairs from ChatTS. Multivariate and univariate examples are sampled at a 2:1 ratio. **Domain distribution:** | Domain | Samples | Percentage | Sources | |--------|---------|------------|---------| | Health | 237,050 | 52.82% | UTSD, NormWear, Capture-24 | | Synthetic | 105,085 | 23.41% | ChatTS | | Web | 67,865 | 15.12% | UTSD | | Nature | 32,358 | 7.21% | UTSD | | Energy | 2,743 | 0.61% | UTSD | | IoT | 2,611 | 0.58% | UTSD | | Environment | 1,082 | 0.24% | UTSD | | Transport | 28 | 0.01% | UTSD | Hierarchical captions were automatically generated at three levels: 1. **Statistical:** Summary statistics (mean, variance, trends). 2. **Structural:** Temporal motifs, periodicity, and shape descriptors. 3. **Semantic:** High-level behavioral or domain-specific interpretations. Qwen2-7B-IT generated three paraphrases per caption; one is randomly sampled during training. ## Evaluation Data The 11 downstream datasets span four task domains and are used for linear-probing classification and zero-shot sensor–text retrieval. | Dataset | Folder | Domain | Sensor | Samples (Train/Test) | Freq. | Classes | |---------|--------|--------|--------|---------------------|-------|---------| | WISDM | `wisdm/` | Activity Recognition | Accelerometer X, Y, Z | 22,396 / 5,600 | 30Hz | 18 | | UCI-HAR | `uci_har/` | Activity Recognition | Accelerometer + Gyroscope (6-ch) | 1,847 / 793 | 50Hz | 7 | | Stroke (PPG-CVA) | `PPG_CVA/` | Clinical Diagnosis | PPG | 525 / 132 | 65Hz | 2 | | Diabetes (PPG-DM) | `PPG_DM/` | Clinical Diagnosis | PPG | 522 / 135 | 65Hz | 2 | | Hypertension (PPG-HTN) | `PPG_HTN/` | Clinical Diagnosis | PPG | 525 / 132 | 65Hz | 4 | | Sleep Stage | `sleepEDF/` | Clinical Diagnosis | EEG (2-ch) | 33,599 / 8,709 | 100Hz | 5 | | Heart Condition (PTB-XL) | `ptbxl/` | Clinical Diagnosis | 12-lead ECG | 11,320 / 1,650 | 100Hz | 5 | | WESAD | `wesad/` | Stress Prediction | Chest + Wrist sensors (13-ch) | 882 / 223 | 700Hz | 3 | | StudentLife | `studentlife/` | Stress Prediction | Phone + wearable (10-ch) | 1,074 / 109 | Minute | 3 | | AsphaltObstacles | `AsphaltObstacles/` | Urban Sensing | Acceleration magnitude | 390 / 391 | 100Hz | 4 | | Beijing AQI | `Beijing_AQI/` | Urban Sensing | Environmental sensors (7-ch) | 1,168 / 293 | Hourly | 4 | ## Uses ### Direct Use - **Pretraining** sensor–language models via contrastive alignment and/or captioning objectives using the `data/` folder. - **Evaluating** pretrained sensor encoders on linear-probing classification and zero-shot retrieval using the 11 evaluation datasets. - **Research** on cross-modal alignment between time-series sensor data and natural language. ### Out-of-Scope Use - This dataset should **not** be used for clinical decision-making. It is intended for research purposes only. - The dataset is not designed for forecasting tasks; it targets sensor–language alignment for classification, retrieval, question answering, and captioning. ## Dataset Creation ### Curation Rationale Pretraining sensor–language models requires large-scale paired time-series and text data, which is far less available than in vision–language settings. This dataset addresses this gap by assembling diverse time-series sources and generating hierarchical captions to provide multi-level supervision. The evaluation datasets are curated to cover heterogeneous sensor configurations (varying channel counts, sampling rates, and sequence lengths) across four distinct application domains. ### Source Data **Pretraining sources:** - **UTSD:** Community-released time-series corpus spanning energy, environment, IoT, nature, transport, and web domains. - **NormWear:** Health-related wearable sensor data. - **Capture-24:** Accelerometer data from wrist-worn devices. - **ChatTS:** Synthetic time-series–text pairs for pattern diversity augmentation. **Evaluation sources:** - **WISDM:** Wireless Sensor Data Mining dataset for activity recognition. - **UCI-HAR:** UCI Human Activity Recognition using smartphones dataset. - **PPG-CVA / PPG-DM / PPG-HTN:** PPG-based clinical datasets from the PPG-BP Chinese dataset. - **Sleep-EDF:** EEG sleep staging dataset. - **PTB-XL:** Large-scale 12-lead ECG dataset. - **WESAD:** Multimodal dataset for wearable stress and affect detection. - **StudentLife:** Smartphone and wearable sensing dataset for student stress. - **AsphaltObstacles:** Road surface obstacle detection from vehicle accelerometers. - **Beijing AQI:** Beijing multi-site air quality dataset. ### Personal and Sensitive Information Health-domain data (PPG, ECG, EEG) originates from publicly available, de-identified datasets. No personally identifiable information is included. ## Bias, Risks, and Limitations - **Domain imbalance:** Health data dominates the pretraining corpus (52.82%), while domains like transport (0.01%) and environment (0.24%) are underrepresented. - **Synthetic data:** 23.41% of the pretraining data is synthetically generated (ChatTS), which may introduce distribution shifts relative to real-world sensor data. - **Caption quality:** Captions are automatically generated and paraphrased by language models, and may contain errors or hallucinations. - **Clinical data:** Health-related subsets should not be used for medical diagnosis without proper clinical validation. - **Language:** All text is in English. ### Recommendations Users should be aware of the domain imbalance when interpreting downstream results. Careful evaluation is necessary when deploying models pretrained on this data in high-stakes settings to ensure robustness and faithful grounding in sensor evidence. ## Citation **BibTeX:** ```bibtex @article{chen2025slip, title={Learning Transferable Sensor Models via Language-Informed Pretraining}, author={Chen, Yuliang and Pillai, Arvind and Wu, Yu Yvonne and Griffin, Tess Z. and Marsch, Lisa and Heinz, Michael V. and Jacobson, Nicholas C. and Campbell, Andrew}, year={2025} } ``` ## Dataset Card Contact Yuliang Chen — yuliang.chen.gr@dartmouth.edu

提供机构：

LeoChen085

5,000+

优质数据集

54 个

任务类型

进入经典数据集