Pallier 2025 (LittlePrince Listen)#

pnpl.datasets.Pallier2025 wraps the audiobook-listening MEG dataset released by Pallier et al. (2025) on OpenNeuro (ds007523LittlePrince_MEG_French_Listen_Pallier2025).

  • 58 native French adults × 1 session × 9 runs (one ~10-min audiobook segment per run)

  • Elekta Neuromag TRIUX MEG, 306 channels (102 magnetometers + 204 planar gradiometers), raw .fif

  • Sampled at 1000 Hz (online HP 0.1 Hz / LP 330 Hz, 50 Hz line frequency)

  • Open access — no auth required

  • ~478 GB total / ~8 GB per subject

The companion paper is d’Ascoli, Bel, Rapin, King et al., Nat Commun 16:10521 (2025) — A zero-shot decoder of words from M/EEG signals.

Quickstart#

from pnpl.datasets import Pallier2025
from pnpl.tasks.pallier2025 import WordClassification

ds = Pallier2025(
    data_path="./data/pallier2025",
    task=WordClassification(tmin=0.0, tmax=3.0),
    include_subjects=["01"],
    include_runs=["01"],            # one ~10 min audiobook segment
    preprocessing="notch+bp+ds",    # 50/100 Hz notch, 0.1–125 Hz bp, 250 Hz resample
    download=True,
    standardize=True,
)

x, y = ds[0]
print(x.shape, y.item())   # (306, 750), some int word id

The first construction downloads the requested raw FIF (~870 MB per run) and its events.tsv sidecar from OpenNeuro’s public S3 bucket, runs the preprocessing pipeline chunk-by-chunk to keep memory bounded, and caches the result as H5. Subsequent constructions read directly from the cached H5.

Note

ds007523 was recorded with Elekta active shielding (MaxShield) enabled, which sets a header tag MNE refuses by default. The loader acknowledges the tag and passes the raw signal through to the preprocessing pipeline; expect a one-time RuntimeWarning per run on first preprocessing. The companion paper deliberately skips MaxFilter / SSS, so this is the intended path.

BIDS axes#

Axis

Values

subject

"01""58"

session

"01" (the only one)

task

"listen" (the only one)

run

"01""09"

Run keys are 4-tuples (subject, session, task, run). The include_* / exclude_* filters narrow the set; pass include_run_keys=[(subject, session, task, run), ...] for fully specified inclusion.

Reproducing the d’Ascoli 2025 preprocessing#

The paper uses 0.1–40 Hz bandpass and resampling to 50 Hz, with no notch or SSS. The pipeline is configurable on a per-step basis:

ds = Pallier2025(
    data_path="./data/pallier2025",
    task=WordClassification(tmin=0.0, tmax=3.0),
    include_subjects=["01"],
    preprocessing="bp+ds",
    preprocessing_config={
        "bp": {"l_freq": 0.1, "h_freq": 40.0},
        "ds": {"sfreq": 50.0},
    },
)

See Preprocessing for the full step-config mechanism (defaults < JSON < dataset config).

Selected arguments#

  • task — currently pnpl.tasks.pallier2025.WordClassification.

  • preprocessing — pipeline string used in derivative filenames. Default "notch+bp+ds". None materializes H5 from the raw FIF unchanged.

  • preprocessing_config — per-step overrides forwarded to pnpl.preprocessing.Pipeline.

  • include_subjects, include_sessions, include_tasks, include_runs, include_run_keys (and their exclude_* counterparts).

  • standardize, clipping_boundary, channel_means, channel_stds.

  • create_h5_if_missing (default True).

  • preload_h5 (default False).

Available task#

WordClassification(tmin, tmax, min_word_length, max_word_length, keep_top_k) — windowed around each word onset.

  • tmin, tmax — defaults 0.0, 3.0 to match d’Ascoli 2025.

  • min_word_length — drop short / single-letter tokens. The audiobook tokenization includes elided particles (e.g. j, l) which you may want to filter out.

  • keep_top_k — restrict to the k most-frequent tokens across the requested runs. Useful for the paper’s “top-250” evaluation.

Sample tuple: (subject, session, task, run, onset, word_str). The label is the word’s index in the resolved vocabulary.