tankalapavankalyan/exp01-eeg-to-text-sentences
收藏Hugging Face2026-04-24 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/tankalapavankalyan/exp01-eeg-to-text-sentences
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
language:
- en
task_categories:
- text-generation
- text2text-generation
tags:
- eeg
- brain-signals
- neuroscience
- reading
- brain-to-text
- eeg-to-text
- electrophysiology
size_categories:
- 10K<n<100K
pretty_name: "Exp01 — Sentence-level EEG-to-text training data (unified)"
---
# Exp01 — Sentence-level EEG-to-text training data (unified)
This is a **private** working corpus for experiment 1 (fine-tuning EEG / time-series
foundation models on EEG-to-English-text). It bundles several public EEG-while-reading
datasets into a single, raw-lossless parquet schema where one row = one sentence read by
one participant.
> ⚠️ **License:** Per-source licenses are preserved verbatim in each row's `license`
> column and `source_url`. Do not re-distribute publicly without re-checking the upstream
> terms; the `other` SPDX tag here is only because this repo aggregates several distinct
> licenses (CC0-1.0, CC-BY-4.0, MIT, Apache-2.0).
## Source datasets and provenance
| `dataset` value(s) | Source | License | Paradigm | Sampling rate | Channels |
| --- | --- | --- | --- | --- | --- |
| `zuco_v1_sr` | ZuCo 1.0 task1 (Hollenstein et al.) — re-hosted by `pyinglie/EEG-To-Text_Datasets` | CC BY 4.0 | Natural reading + sentiment classification | 500 Hz | ~105 (EGI HydroCel GSN) |
| `zuco_v1_nr` | ZuCo 1.0 task2 | CC BY 4.0 | Natural reading | 500 Hz | ~105 |
| `zuco_v1_tsr` | ZuCo 1.0 task3 | CC BY 4.0 | Task-specific reading (relation extraction) | 500 Hz | ~105 |
| `zuco_v2_nr` | ZuCo 2.0 task1 | CC BY 4.0 | Natural reading | 500 Hz | ~105 |
| `zuco_v2_tsr` | ZuCo 2.0 task2 | CC BY 4.0 | Task-specific reading (relation extraction) | 500 Hz | ~105 |
| `derco_preprocessed` | DERCo (Nguyen et al., Sci. Data 2024, OSF rkqbu) | CC BY 4.0 | RSVP word-by-word + RSVP-with-flanker | 1000 Hz | 32 (10-20, ActiCHamp) |
| `emmt` | EMMT reading-only via `sajjad5221/eeg2text-emmt-dataset-unique-sentences` | MIT (HF metadata) / CC BY 4.0 (original EMMT) | Natural reading + eye-tracking | 256 Hz | 4 (Muse: TP9/AF7/AF8/TP10) |
| `eeg_sem_relev` | `Quoron/EEG-semantic-text-relevance` (Gryshchuk et al., SIGIR 2025; OSF xh3g5) | Apache-2.0 (HF metadata) | RSVP reading + topic-relevance judgement | ~2858 Hz (2001 samples / 0.7s window) | 32 |
Skipped (redundant): `pyinglie/EEG-To-Text_Preprocessed` (only adds derived BART
pickles), `sajjad5221/eeg2text-emmt-dataset` non-unique variant (the unique-sentences
variant covers the same EEG with richer gaze metadata).
## Unified schema
Every row is a single sentence read by a single participant. EEG is stored at the
**native** sampling rate, with the **native** channel count for that source — no
resampling, no channel selection, no rereferencing.
| Column | Type | Notes |
| --- | --- | --- |
| `dataset` | string (required) | One of the values listed above |
| `source_url` | string (required) | Upstream URL |
| `license` | string (required) | Verbatim license text for this row's source |
| `paradigm` | string (required) | e.g. `natural_reading_sentiment_classification`, `rsvp_word_by_word` |
| `language` | string (required) | Always `en` |
| `participant_id` | string (required) | Anonymised subject id |
| `session_id` | string (nullable) | Source-specific (e.g. DERCo `article_3`) |
| `task_id` | string (required) | Source-specific task name |
| `trial_index` | int32 (nullable) | Order within the source file |
| `sentence_id` | string (required) | Globally unique, prefixed by `dataset` |
| `sentence_text` | string (required) | The sentence as displayed |
| `sentence_text_normalized` | string (nullable) | Lowercased / stripped |
| `words` | list<string> (required) | Words in order |
| `num_words` | int32 (required) | Length of `words` |
| `num_channels` | int32 (required) | Channel count for this row's EEG |
| `channel_names` | list<string> (required) | Original electrode labels (or `chNN` if not available) |
| `channel_units` | string (nullable) | Free text |
| `sampling_rate_hz` | float64 (required) | Native sampling rate |
| `reference_scheme` | string (nullable) | e.g. `Cz_vertex`, `common_average`, `Muse-headband-internal` |
| `preprocessing` | string (required) | Source-reported preprocessing summary |
| `word_eeg_segments` | `list<list<list<float32>>>` (nullable) | `[words][channels][time]` — fixation- or trial-locked per-word EEG (variable time per word). |
| `sentence_eeg` | `list<list<float32>>` (nullable) | `[channels][time]` — full continuous-sentence EEG (only some sources expose this; e.g. ZuCo `rawData`). |
| `word_onsets_sec` | list<float32> (nullable) | Per-word onset relative to sentence start, when known |
| `word_durations_sec` | list<float32> (nullable) | Per-word epoch length |
| `word_band_features` | `list<list<float32>>` (nullable) | `[words][feature_block × channels]` — pre-computed band-power features (ZuCo only). Order matches `band_feature_names`. |
| `band_feature_names` | list<string> (nullable) | e.g. `["FFD_t1", "FFD_t2", ..., "SFD_g2"]` for ZuCo |
| `fixation_duration_ms` | list<float32> (nullable) | Per-word total fixation time (EMMT) |
| `fixation_count` | list<int32> (nullable) | Per-word number of fixations |
| `first_fixation_duration_ms` | list<float32> (nullable) | ZuCo `FFD` |
| `gaze_duration_ms` | list<float32> (nullable) | ZuCo `GD` |
| `go_past_time_ms` | list<float32> (nullable) | ZuCo `GPT` |
| `total_reading_time_ms` | list<float32> (nullable) | ZuCo `TRT` |
| `saccade_amplitude_px` | list<float32> (nullable) | EMMT |
| `regression_count` | list<int32> (nullable) | EMMT |
| `blink_count` | list<int32> (nullable) | EMMT |
| `topic` | string (nullable) | Document topic (Quoron); article id (DERCo) |
| `selected_topic` | string (nullable) | Quoron only |
| `semantic_relevance` | list<int32> (nullable) | Per-word 0/1 relevance judgement (Quoron) |
| `interestingness` | int32 (nullable) | Quoron |
| `pre_knowledge` | int32 (nullable) | Quoron |
| `cloze_probability` | list<float32> (nullable) | DERCo (computable from the behavioural-prediction CSVs) |
| `comprehension_label` | string (nullable) | Reserved |
| `extra_json` | string (nullable) | Per-dataset metadata (storage layouts, code mappings, etc.) — JSON-encoded |
| `dataset_extra_blob` | binary (nullable) | NPZ-compressed dump of every additional per-row array we did not put in a typed column. **Used by ZuCo only**, where it carries: per-sentence `mean_*` band-power means/diffs/sec; per-word fixation-locked features (`FFD_*`, `GD_*`, `GPT_*`, `TRT_*`, `SFD_*`) for all 8 frequency bands; per-word `rawET` eye-tracking traces, pupil sizes, fixation positions; sentence-level `wordbounds`, `omissionRate`, and answer-period band powers. Nothing in this blob is duplicated from typed columns. |
### EEG storage convention
All `word_eeg_segments` arrays are stored **channels-first** (`[channels][time]`),
matching the MNE convention. EMMT's source `[time][channels]` layout is transposed
on conversion. The original layout is recorded in `extra_json.original_eeg_layout`.
### Why some columns are NULL for some sources
Each source dataset only provides a subset of the available metadata. We never invent
values. For example:
- ZuCo carries the richest set of per-word band features and eye-tracking metrics.
- EMMT provides per-word EEG + a small set of gaze counts but no band features.
- Quoron provides per-word EEG + topic-relevance labels but no eye-tracking.
- DERCo provides per-word EEG and per-article cloze probabilities (will be filled
from the behavioural CSVs in a follow-up).
Use `dataset` to filter rows when training a model that requires a specific feature.
### Dedup decisions
The user's brief was: *"if there is repetition, leave those, otherwise do everything."*
The following candidate sources were skipped because they are strict subsets / derived
products of other sources we already include:
- `pyinglie/EEG-To-Text_Datasets` — used only as a fast HF mirror of the ZuCo MAT
files; the data is the original ZuCo content.
- `pyinglie/EEG-To-Text_Preprocessed` — only contains BART-style pickled features
derived from the same MAT files; raw EEG is in `zuco_v*_*` rows.
- `sajjad5221/eeg2text-emmt-dataset` (non-unique) — superseded by the
`-unique-sentences` variant which has 23 fewer duplicate samples but adds the
per-word gaze feature columns.
### Re-using and reproducing
Every row carries `source_url`, `license`, and `dataset_extra_blob` so you can always
trace back to the upstream and reload the source-specific arrays:
```python
import io, numpy as np
from datasets import load_dataset
ds = load_dataset("tankalapavankalyan/exp01-eeg-to-text-sentences",
data_files="data/zuco_v1_sr__sub-ZAB.parquet", split="train")
row = ds[0]
print(row["sentence_text"])
print("EEG shape (chan, time):",
len(row["sentence_eeg"]), len(row["sentence_eeg"][0]))
extras = np.load(io.BytesIO(row["dataset_extra_blob"]))
print("extras keys:", extras.files[:10])
```
## Conversion pipeline
The full pipeline is a streaming download → convert → push → delete loop with peak
local disk usage bounded to ~2 GB at any time. The orchestrator is
`experiments/exp01_finetune_foundation_models/scripts/preprocess/stream_pipeline.py`,
backed by per-source converters (`convert_emmt.py`, `convert_quoron.py`,
`convert_zuco.py`, `convert_derco.py`) and the unified schema definition in
`experiments/exp01_finetune_foundation_models/src/data/unified_schema.py`. Source
files are downloaded one at a time from HuggingFace mirrors (or OSF for DERCo),
converted in `/tmp`, uploaded to this repo, and deleted before the next file is
fetched. Nothing persists in the project tree.
提供机构:
tankalapavankalyan



