EEG Corpus Manifest 是一个针对大型EEG预训练语料库构建的规范化、可查询索引数据集,该语料库由多个开放获取的神经科学数据集汇编而成。该数据集本身仅包含元数据,不包含任何实际的EEG信号数据,每条记录指向源数据集的归档存储URI(如S3、OpenNeuro、PhysioNet AWS Open Data等),访问底层EEG信号需遵循原始数据集的许可协议。数据规模方面,v0.4.1版本覆盖79,418小时的EEG数据,包含186,386个记录文件、43,114名受试者和419个索引数据集,在数据量上超过了REVE和DIVER-1等现有基准。数据集采用星型模式组织,包含5个Parquet表:1) `datasets`表(419行):记录每个源数据集的信息,包括许可、DOI、范式类别等;2) `recordings`表(186,386行):事实表,记录每个EEG文件的基本信息,包括受试者、会话、任务、持续时间、采样率、通道数、参考电极、制造商、蒙太奇、BIDS实体和S3 URI等;3) `channels`表(11,530,668行):记录每个记录的每个通道信息,包括名称、类型、单位和3D坐标(当数据集提供electrodes.tsv时);4) `subjects`表(43,114行):记录每个规范化受试者的人口统计学信息,包括年龄、性别、临床状态和利手;5) `shards`表(0行):占位符表,用于未来索引预训练就绪的数据分片。数据来源广泛,包括:锚定数据集(HBN-EEG、PEERS Memory EEG、TUH-EEG家族)、397个通过OpenNeuro自动获取的数据集,以及Tier 2基准数据集(PhysioNet的Sleep-EDFx、HMC、CAP、CHB-MIT、Siena、MMI、EEGMAT,以及Mumtaz抑郁数据集和TUH的TUAB、TUAR、TUEV等)。这些数据涵盖了多种EEG范式,包括睡眠分期、癫痫检测、运动想象、情绪识别等。该数据集专为大规模EEG预训练和基础模型研究设计,支持通过Hugging Face datasets库、Pandas、DuckDB、Polars等多种工具进行程序化访问和查询。元数据本身采用CC-BY-4.0许可,但用户在使用底层EEG信号时必须遵守原始数据集的许可要求,部分数据集(如TUH家族)需要签署数据使用协议。
EEG Corpus Manifest is a standardized, queryable indexing dataset for large-scale EEG pretraining corpora, compiled from multiple open-access neuroscience datasets. The dataset itself contains only metadata and no actual EEG signal data; each record points to the source datasets archival storage URI (e.g., S3, OpenNeuro, PhysioNet AWS Open Data, etc.). Access to the underlying EEG signals requires compliance with the original datasets licensing agreements. In terms of scale, version v0.4.1 covers 79,418 hours of EEG data, including 186,386 recording files, 43,114 subjects, and 419 indexed datasets, surpassing existing benchmarks such as REVE and DIVER-1 in data volume. The dataset is organized in a star schema with five Parquet tables: 1) `datasets` table (419 rows): records information for each source dataset, including license, DOI, paradigm categories, etc.; 2) `recordings` table (186,386 rows): the fact table, recording basic information for each EEG file, including subject, session, task, duration, sampling rate, channel count, reference electrode, manufacturer, montage, BIDS entities, and S3 URI; 3) `channels` table (11,530,668 rows): records information for each channel per recording, including name, type, unit, and 3D coordinates (when the dataset provides electrodes.tsv); 4) `subjects` table (43,114 rows): records demographic information for each normalized subject, including age, gender, clinical status, and handedness; 5) `shards` table (0 rows): a placeholder table for future indexing of pretraining-ready data shards. Data sources are extensive, including anchor datasets (HBN-EEG, PEERS Memory EEG, TUH-EEG family), 397 datasets automatically acquired via OpenNeuro, and Tier 2 benchmark datasets (PhysioNets Sleep-EDFx, HMC, CAP, CHB-MIT, Siena, MMI, EEGMAT, as well as Mumtaz depression dataset and TUHs TUAB, TUAR, TUEV, etc.). These data cover various EEG paradigms, such as sleep staging, epilepsy detection, motor imagery, and emotion recognition. The dataset is specifically designed for large-scale EEG pretraining and foundational model research, supporting programmatic access and querying through tools like Hugging Face datasets library, Pandas, DuckDB, and Polars. The metadata itself is licensed under CC-BY-4.0, but users must adhere to the original datasets licensing requirements when using the underlying EEG signals, with some datasets (e.g., TUH family) requiring a data use agreement.