Gwilliams 2022 (MEG-MASC)#
pnpl.datasets.Gwilliams2022 wraps the MEG-MASC corpus (Gwilliams
et al., 2022; arXiv:2208.11488)
released on OSF (project ag3kj).
27 subjects, up to 2 sessions, 4 stories per session
208-channel KIT/Yokogawa MEG, raw
.confilesOpen access — no auth required
The dataset is split across four sibling OSF components because of OSF’s per-component storage cap; the loader transparently aggregates files across all four.
Quickstart#
from pnpl.datasets import Gwilliams2022
from pnpl.tasks.gwilliams2022 import PhonemeClassification
ds = Gwilliams2022(
data_path="./data/meg_masc",
task=PhonemeClassification(tmin=-0.2, tmax=0.6),
include_subjects=["01"],
include_sessions=["0"],
include_tasks=["0"], # story 0 = "lw1"
preprocessing="notch+bp+ds", # 50/100 Hz notch, 0.1–125 Hz bp, 250 Hz resample
download=True,
standardize=True,
)
x, y = ds[0]
print(x.shape, y.item())
The first construction downloads the raw KIT recording (plus the
markers.mrk and acq-{ELP,HSP}_headshape.pos sidecars KIT requires),
runs the preprocessing pipeline, and caches the result as H5. Each
subject-session is roughly 3–4 GB of raw .con plus a few MB of
sidecars — start small, then scale up.
BIDS axes#
Axis |
Values |
|---|---|
|
|
|
|
|
|
|
always |
Run keys are 4-tuples (subject, session, task, run). The release is
not fully balanced — some subjects only have one session — so the
loader skips run keys whose files aren’t in the OSF manifest and
prints a warning.
Selected arguments#
task— any object implementingTaskProtocol. Built-ins live inpnpl.tasks.gwilliams2022(PhonemeClassification,WordClassification).preprocessing— pipeline string used in derivative filenames (e.g."notch+bp+ds"). Set toNoneto materialize H5 from the raw KIT recording without filtering.preprocessing_config— per-step overrides forwarded topnpl.preprocessing.Pipeline. See Preprocessing.include_subjects,include_sessions,include_tasks,include_run_keys(and theirexclude_*counterparts) — narrow the set of run keys.standardize,clipping_boundary,channel_means,channel_stds— seeStandardizationMixin.create_h5_if_missing(defaultTrue) — when there is no cached H5 on OSF, run the pipeline locally and cache it.
Available tasks#
from pnpl.tasks.gwilliams2022 import PhonemeClassification, WordClassification
PhonemeClassification(tmin, tmax, label_type="phoneme" | "voicing")— sample windowed around each phoneme onset.WordClassification(tmin, tmax, require_pronounced=True)— sample windowed around each word onset; label is the word string.