tankalapavankalyan/exp01-eeg-to-text-sentences

Name: tankalapavankalyan/exp01-eeg-to-text-sentences
Creator: tankalapavankalyan
Published: 2026-04-24 13:37:34
License: 暂无描述

Hugging Face2026-04-24 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/tankalapavankalyan/exp01-eeg-to-text-sentences

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other language: - en task_categories: - text-generation - text2text-generation tags: - eeg - brain-signals - neuroscience - reading - brain-to-text - eeg-to-text - electrophysiology size_categories: - 10K<n<100K pretty_name: "Exp01 — Sentence-level EEG-to-text training data (unified)" --- # Exp01 — Sentence-level EEG-to-text training data (unified) This is a **private** working corpus for experiment 1 (fine-tuning EEG / time-series foundation models on EEG-to-English-text). It bundles several public EEG-while-reading datasets into a single, raw-lossless parquet schema where one row = one sentence read by one participant. > ⚠️ **License:** Per-source licenses are preserved verbatim in each row's `license` > column and `source_url`. Do not re-distribute publicly without re-checking the upstream > terms; the `other` SPDX tag here is only because this repo aggregates several distinct > licenses (CC0-1.0, CC-BY-4.0, MIT, Apache-2.0). ## Source datasets and provenance | `dataset` value(s) | Source | License | Paradigm | Sampling rate | Channels | | --- | --- | --- | --- | --- | --- | | `zuco_v1_sr` | ZuCo 1.0 task1 (Hollenstein et al.) — re-hosted by `pyinglie/EEG-To-Text_Datasets` | CC BY 4.0 | Natural reading + sentiment classification | 500 Hz | ~105 (EGI HydroCel GSN) | | `zuco_v1_nr` | ZuCo 1.0 task2 | CC BY 4.0 | Natural reading | 500 Hz | ~105 | | `zuco_v1_tsr` | ZuCo 1.0 task3 | CC BY 4.0 | Task-specific reading (relation extraction) | 500 Hz | ~105 | | `zuco_v2_nr` | ZuCo 2.0 task1 | CC BY 4.0 | Natural reading | 500 Hz | ~105 | | `zuco_v2_tsr` | ZuCo 2.0 task2 | CC BY 4.0 | Task-specific reading (relation extraction) | 500 Hz | ~105 | | `derco_preprocessed` | DERCo (Nguyen et al., Sci. Data 2024, OSF rkqbu) | CC BY 4.0 | RSVP word-by-word + RSVP-with-flanker | 1000 Hz | 32 (10-20, ActiCHamp) | | `emmt` | EMMT reading-only via `sajjad5221/eeg2text-emmt-dataset-unique-sentences` | MIT (HF metadata) / CC BY 4.0 (original EMMT) | Natural reading + eye-tracking | 256 Hz | 4 (Muse: TP9/AF7/AF8/TP10) | | `eeg_sem_relev` | `Quoron/EEG-semantic-text-relevance` (Gryshchuk et al., SIGIR 2025; OSF xh3g5) | Apache-2.0 (HF metadata) | RSVP reading + topic-relevance judgement | ~2858 Hz (2001 samples / 0.7s window) | 32 | Skipped (redundant): `pyinglie/EEG-To-Text_Preprocessed` (only adds derived BART pickles), `sajjad5221/eeg2text-emmt-dataset` non-unique variant (the unique-sentences variant covers the same EEG with richer gaze metadata). ## Unified schema Every row is a single sentence read by a single participant. EEG is stored at the **native** sampling rate, with the **native** channel count for that source — no resampling, no channel selection, no rereferencing. | Column | Type | Notes | | --- | --- | --- | | `dataset` | string (required) | One of the values listed above | | `source_url` | string (required) | Upstream URL | | `license` | string (required) | Verbatim license text for this row's source | | `paradigm` | string (required) | e.g. `natural_reading_sentiment_classification`, `rsvp_word_by_word` | | `language` | string (required) | Always `en` | | `participant_id` | string (required) | Anonymised subject id | | `session_id` | string (nullable) | Source-specific (e.g. DERCo `article_3`) | | `task_id` | string (required) | Source-specific task name | | `trial_index` | int32 (nullable) | Order within the source file | | `sentence_id` | string (required) | Globally unique, prefixed by `dataset` | | `sentence_text` | string (required) | The sentence as displayed | | `sentence_text_normalized` | string (nullable) | Lowercased / stripped | | `words` | list<string> (required) | Words in order | | `num_words` | int32 (required) | Length of `words` | | `num_channels` | int32 (required) | Channel count for this row's EEG | | `channel_names` | list<string> (required) | Original electrode labels (or `chNN` if not available) | | `channel_units` | string (nullable) | Free text | | `sampling_rate_hz` | float64 (required) | Native sampling rate | | `reference_scheme` | string (nullable) | e.g. `Cz_vertex`, `common_average`, `Muse-headband-internal` | | `preprocessing` | string (required) | Source-reported preprocessing summary | | `word_eeg_segments` | `list<list<list<float32>>>` (nullable) | `[words][channels][time]` — fixation- or trial-locked per-word EEG (variable time per word). | | `sentence_eeg` | `list<list<float32>>` (nullable) | `[channels][time]` — full continuous-sentence EEG (only some sources expose this; e.g. ZuCo `rawData`). | | `word_onsets_sec` | list<float32> (nullable) | Per-word onset relative to sentence start, when known | | `word_durations_sec` | list<float32> (nullable) | Per-word epoch length | | `word_band_features` | `list<list<float32>>` (nullable) | `[words][feature_block × channels]` — pre-computed band-power features (ZuCo only). Order matches `band_feature_names`. | | `band_feature_names` | list<string> (nullable) | e.g. `["FFD_t1", "FFD_t2", ..., "SFD_g2"]` for ZuCo | | `fixation_duration_ms` | list<float32> (nullable) | Per-word total fixation time (EMMT) | | `fixation_count` | list<int32> (nullable) | Per-word number of fixations | | `first_fixation_duration_ms` | list<float32> (nullable) | ZuCo `FFD` | | `gaze_duration_ms` | list<float32> (nullable) | ZuCo `GD` | | `go_past_time_ms` | list<float32> (nullable) | ZuCo `GPT` | | `total_reading_time_ms` | list<float32> (nullable) | ZuCo `TRT` | | `saccade_amplitude_px` | list<float32> (nullable) | EMMT | | `regression_count` | list<int32> (nullable) | EMMT | | `blink_count` | list<int32> (nullable) | EMMT | | `topic` | string (nullable) | Document topic (Quoron); article id (DERCo) | | `selected_topic` | string (nullable) | Quoron only | | `semantic_relevance` | list<int32> (nullable) | Per-word 0/1 relevance judgement (Quoron) | | `interestingness` | int32 (nullable) | Quoron | | `pre_knowledge` | int32 (nullable) | Quoron | | `cloze_probability` | list<float32> (nullable) | DERCo (computable from the behavioural-prediction CSVs) | | `comprehension_label` | string (nullable) | Reserved | | `extra_json` | string (nullable) | Per-dataset metadata (storage layouts, code mappings, etc.) — JSON-encoded | | `dataset_extra_blob` | binary (nullable) | NPZ-compressed dump of every additional per-row array we did not put in a typed column. **Used by ZuCo only**, where it carries: per-sentence `mean_*` band-power means/diffs/sec; per-word fixation-locked features (`FFD_*`, `GD_*`, `GPT_*`, `TRT_*`, `SFD_*`) for all 8 frequency bands; per-word `rawET` eye-tracking traces, pupil sizes, fixation positions; sentence-level `wordbounds`, `omissionRate`, and answer-period band powers. Nothing in this blob is duplicated from typed columns. | ### EEG storage convention All `word_eeg_segments` arrays are stored **channels-first** (`[channels][time]`), matching the MNE convention. EMMT's source `[time][channels]` layout is transposed on conversion. The original layout is recorded in `extra_json.original_eeg_layout`. ### Why some columns are NULL for some sources Each source dataset only provides a subset of the available metadata. We never invent values. For example: - ZuCo carries the richest set of per-word band features and eye-tracking metrics. - EMMT provides per-word EEG + a small set of gaze counts but no band features. - Quoron provides per-word EEG + topic-relevance labels but no eye-tracking. - DERCo provides per-word EEG and per-article cloze probabilities (will be filled from the behavioural CSVs in a follow-up). Use `dataset` to filter rows when training a model that requires a specific feature. ### Dedup decisions The user's brief was: *"if there is repetition, leave those, otherwise do everything."* The following candidate sources were skipped because they are strict subsets / derived products of other sources we already include: - `pyinglie/EEG-To-Text_Datasets` — used only as a fast HF mirror of the ZuCo MAT files; the data is the original ZuCo content. - `pyinglie/EEG-To-Text_Preprocessed` — only contains BART-style pickled features derived from the same MAT files; raw EEG is in `zuco_v*_*` rows. - `sajjad5221/eeg2text-emmt-dataset` (non-unique) — superseded by the `-unique-sentences` variant which has 23 fewer duplicate samples but adds the per-word gaze feature columns. ### Re-using and reproducing Every row carries `source_url`, `license`, and `dataset_extra_blob` so you can always trace back to the upstream and reload the source-specific arrays: ```python import io, numpy as np from datasets import load_dataset ds = load_dataset("tankalapavankalyan/exp01-eeg-to-text-sentences", data_files="data/zuco_v1_sr__sub-ZAB.parquet", split="train") row = ds[0] print(row["sentence_text"]) print("EEG shape (chan, time):", len(row["sentence_eeg"]), len(row["sentence_eeg"][0])) extras = np.load(io.BytesIO(row["dataset_extra_blob"])) print("extras keys:", extras.files[:10]) ``` ## Conversion pipeline The full pipeline is a streaming download → convert → push → delete loop with peak local disk usage bounded to ~2 GB at any time. The orchestrator is `experiments/exp01_finetune_foundation_models/scripts/preprocess/stream_pipeline.py`, backed by per-source converters (`convert_emmt.py`, `convert_quoron.py`, `convert_zuco.py`, `convert_derco.py`) and the unified schema definition in `experiments/exp01_finetune_foundation_models/src/data/unified_schema.py`. Source files are downloaded one at a time from HuggingFace mirrors (or OSF for DERCo), converted in `/tmp`, uploaded to this repo, and deleted before the next file is fetched. Nothing persists in the project tree.

提供机构：

tankalapavankalyan

5,000+

优质数据集

54 个

任务类型

进入经典数据集