Pallier 2025 (LittlePrince Listen)#
pnpl.datasets.Pallier2025 wraps the audiobook-listening MEG dataset
released by Pallier et al. (2025) on OpenNeuro
(ds007523
— LittlePrince_MEG_French_Listen_Pallier2025).
58 native French adults × 1 session × 9 runs (one ~10-min audiobook segment per run)
Elekta Neuromag TRIUX MEG, 306 channels (102 magnetometers + 204 planar gradiometers), raw
.fifSampled at 1000 Hz (online HP 0.1 Hz / LP 330 Hz, 50 Hz line frequency)
Open access — no auth required
~478 GB total / ~8 GB per subject
The companion paper is d’Ascoli, Bel, Rapin, King et al., Nat Commun 16:10521 (2025) — A zero-shot decoder of words from M/EEG signals.
Quickstart#
from pnpl.datasets import Pallier2025
from pnpl.tasks.pallier2025 import WordClassification
ds = Pallier2025(
data_path="./data/pallier2025",
task=WordClassification(tmin=0.0, tmax=3.0),
include_subjects=["01"],
include_runs=["01"], # one ~10 min audiobook segment
preprocessing="notch+bp+ds", # 50/100 Hz notch, 0.1–125 Hz bp, 250 Hz resample
download=True,
standardize=True,
)
x, y = ds[0]
print(x.shape, y.item()) # (306, 750), some int word id
The first construction downloads the requested raw FIF (~870 MB per
run) and its events.tsv sidecar from OpenNeuro’s public S3 bucket,
runs the preprocessing pipeline chunk-by-chunk to keep memory bounded,
and caches the result as H5. Subsequent constructions read directly
from the cached H5.
Note
ds007523 was recorded with Elekta active shielding (MaxShield)
enabled, which sets a header tag MNE refuses by default. The loader
acknowledges the tag and passes the raw signal through to the
preprocessing pipeline; expect a one-time RuntimeWarning per run on
first preprocessing. The companion paper deliberately skips MaxFilter
/ SSS, so this is the intended path.
BIDS axes#
Axis |
Values |
|---|---|
|
|
|
|
|
|
|
|
Run keys are 4-tuples (subject, session, task, run). The
include_* / exclude_* filters narrow the set; pass
include_run_keys=[(subject, session, task, run), ...] for fully
specified inclusion.
Reproducing the d’Ascoli 2025 preprocessing#
The paper uses 0.1–40 Hz bandpass and resampling to 50 Hz, with no notch or SSS. The pipeline is configurable on a per-step basis:
ds = Pallier2025(
data_path="./data/pallier2025",
task=WordClassification(tmin=0.0, tmax=3.0),
include_subjects=["01"],
preprocessing="bp+ds",
preprocessing_config={
"bp": {"l_freq": 0.1, "h_freq": 40.0},
"ds": {"sfreq": 50.0},
},
)
See Preprocessing for the full step-config mechanism (defaults < JSON < dataset config).
Selected arguments#
task— currentlypnpl.tasks.pallier2025.WordClassification.preprocessing— pipeline string used in derivative filenames. Default"notch+bp+ds".Nonematerializes H5 from the raw FIF unchanged.preprocessing_config— per-step overrides forwarded topnpl.preprocessing.Pipeline.include_subjects,include_sessions,include_tasks,include_runs,include_run_keys(and theirexclude_*counterparts).standardize,clipping_boundary,channel_means,channel_stds.create_h5_if_missing(defaultTrue).preload_h5(defaultFalse).
Available task#
WordClassification(tmin, tmax, min_word_length, max_word_length, keep_top_k) —
windowed around each word onset.
tmin,tmax— defaults0.0,3.0to match d’Ascoli 2025.min_word_length— drop short / single-letter tokens. The audiobook tokenization includes elided particles (e.g.j,l) which you may want to filter out.keep_top_k— restrict to thekmost-frequent tokens across the requested runs. Useful for the paper’s “top-250” evaluation.
Sample tuple: (subject, session, task, run, onset, word_str). The
label is the word’s index in the resolved vocabulary.