pnpl.datasets.libribrain2025.phoneme_dataset.LibriBrainPhoneme

pnpl.datasets.libribrain2025.phoneme_dataset.LibriBrainPhoneme#

class pnpl.datasets.libribrain2025.phoneme_dataset.LibriBrainPhoneme(data_path, partition=None, label_type='phoneme', preprocessing_str='bads+headpos+sss+notch+bp+ds', tmin=0.0, tmax=0.5, include_run_keys=[], exclude_run_keys=[], exclude_tasks=[], standardize=True, clipping_boundary=10, channel_means=None, channel_stds=None, include_info=False, preload_files=True, download=True)[source]#
Parameters:
  • data_path (str)

  • partition (str | None)

  • label_type (str)

  • preprocessing_str (str | None)

  • tmin (float)

  • tmax (float)

  • include_run_keys (list[str])

  • exclude_run_keys (list[str])

  • exclude_tasks (list[str])

  • standardize (bool)

  • clipping_boundary (float | None)

  • channel_means (ndarray | None)

  • channel_stds (ndarray | None)

  • include_info (bool)

  • preload_files (bool)

  • download (bool)

__init__(data_path, partition=None, label_type='phoneme', preprocessing_str='bads+headpos+sss+notch+bp+ds', tmin=0.0, tmax=0.5, include_run_keys=[], exclude_run_keys=[], exclude_tasks=[], standardize=True, clipping_boundary=10, channel_means=None, channel_stds=None, include_info=False, preload_files=True, download=True)[source]#

LibriBrain phoneme classification dataset.

This dataset provides MEG data aligned to phoneme onsets for phoneme classification tasks. Each sample contains MEG data from tmin to tmax seconds relative to a phoneme onset.

Parameters:
  • data_path (str) – Path where you wish to store the dataset. The local dataset structure will follow the same BIDS-like structure as the HuggingFace repo: ` data_path/ ├── {task}/                    # e.g., "Sherlock1"    └── derivatives/        ├── serialised/       # MEG data files           └── sub-{subject}_ses-{session}_task-{task}_run-{run}_proc-{preprocessing_str}_meg.h5        └── events/            # Event timing files            └── sub-{subject}_ses-{session}_task-{task}_run-{run}_events.tsv `

  • partition (str | None) – Convenient shortcut to specify train/validation/test split. Use “train”, “validation”, or “test”. Instead of specifying run keys manually, you can use: - partition=”train”: All runs except validation and test - partition=”validation”: (‘0’, ‘11’, ‘Sherlock1’, ‘2’) - partition=”test”: (‘0’, ‘12’, ‘Sherlock1’, ‘2’)

  • label_type (str) –

    Type of labels to return. Options: - “phoneme”: Return phoneme labels (e.g., ‘aa’, ‘ae’, ‘ah’, etc.) - “voicing”: Return voicing labels derived from phonemes indicating voiced

  • preprocessing_str (str | None) – By default, we expect files with preprocessing string “bads+headpos+sss+notch+bp+ds”. This indicates the preprocessing steps: bads+headpos+sss+notch+bp+ds means the data has been processed for bad channel removal, head position adjustment, signal-space separation, notch filtering, bandpass filtering, and downsampling.

  • tmin (float) – Start time of the sample in seconds relative to phoneme onset. For a phoneme at time T, you grab MEG data from T + tmin up to T + tmax.

  • tmax (float) – End time of the sample in seconds relative to phoneme onset. The number of timepoints per sample = int((tmax - tmin) * sfreq) where sfreq=250Hz.

  • include_run_keys (list[str]) – List of specific sessions to include. Format per session: (‘0’, ‘1’, ‘Sherlock1’, ‘1’) = Subject 0, Session 1, Task Sherlock1, Run 1. You can see all valid run keys by importing RUN_KEYS from pnpl.datasets.libribrain2025.constants.

  • exclude_run_keys (list[str]) – List of sessions to exclude (same format as include_run_keys).

  • exclude_tasks (list[str]) – List of task names to exclude (e.g., [‘Sherlock1’]).

  • standardize (bool) – Whether to z-score normalize each channel’s MEG data using mean and std computed across all included runs. Formula: normalized_data[channel] = (raw_data[channel] - channel_mean[channel]) / channel_std[channel]

  • clipping_boundary (float | None) – If specified, clips all values to [-clipping_boundary, clipping_boundary]. This can help with outliers. Set to None for no clipping.

  • channel_means (ndarray | None) – Pre-computed channel means for standardization. If provided along with channel_stds, these will be used instead of computing from the dataset.

  • channel_stds (ndarray | None) – Pre-computed channel standard deviations for standardization.

  • include_info (bool) – Whether to include additional info dict in each sample containing dataset name, subject, session, task, run, onset time, and full phoneme label (including word position indicators).

  • preload_files (bool) – Whether to “eagerly” download all dataset files from HuggingFace when the dataset object is created (True) or “lazily” download files on demand (False). We recommend leaving this as True unless you have a specific reason not to.

  • download (bool) – Whether to download files from HuggingFace if not found locally (True) or throw an error if files are missing locally (False).

Returns:

Data samples with shape (channels, time) where channels=306 MEG channels. Labels are integers corresponding to phoneme or voicing classes.

Methods

__init__(data_path[, partition, label_type, ...])

LibriBrain phoneme classification dataset.

ensure_file_download(fpath, data_path)

Class method to download a file using the sophisticated LibriBrain download system without requiring dataset instantiation.

load_phonemes_from_tsv(subject, session, ...)

prefetch_files([get_event_files])

Preload all required files in parallel.