pnpl.datasets.libribrain2025.speech_dataset.LibriBrainSpeech#
- class pnpl.datasets.libribrain2025.speech_dataset.LibriBrainSpeech(data_path, partition=None, preprocessing_str='bads+headpos+sss+notch+bp+ds', tmin=0.0, tmax=0.5, include_run_keys=[], exclude_run_keys=[], exclude_tasks=[], standardize=True, clipping_boundary=10, channel_means=None, channel_stds=None, include_info=False, oversample_silence_jitter=0, preload_files=True, stride=None, download=True)[source]#
- Parameters:
data_path (str)
partition (str | None)
preprocessing_str (str | None)
tmin (float)
tmax (float)
include_run_keys (list[str])
exclude_run_keys (list[str])
exclude_tasks (list[str])
standardize (bool)
clipping_boundary (float | None)
channel_means (ndarray | None)
channel_stds (ndarray | None)
include_info (bool)
oversample_silence_jitter (int)
preload_files (bool)
download (bool)
- __init__(data_path, partition=None, preprocessing_str='bads+headpos+sss+notch+bp+ds', tmin=0.0, tmax=0.5, include_run_keys=[], exclude_run_keys=[], exclude_tasks=[], standardize=True, clipping_boundary=10, channel_means=None, channel_stds=None, include_info=False, oversample_silence_jitter=0, preload_files=True, stride=None, download=True)[source]#
LibriBrain speech vs silence classification dataset.
This dataset provides MEG data segmented into time windows for binary classification of speech vs silence. The dataset slides a time window across the continuous MEG data and labels each window based on whether it contains predominantly speech or silence.
- Parameters:
data_path (str) – Path where you wish to store the dataset. The local dataset structure will follow the same BIDS-like structure as the HuggingFace repo:
` data_path/ ├── {task}/ # e.g., "Sherlock1" │ └── derivatives/ │ ├── serialised/ # MEG data files │ │ └── sub-{subject}_ses-{session}_task-{task}_run-{run}_proc-{preprocessing_str}_meg.h5 │ └── events/ # Event files │ └── sub-{subject}_ses-{session}_task-{task}_run-{run}_events.tsv `partition (str | None) – Convenient shortcut to specify train/validation/test split. Use “train”, “validation”, or “test”. Instead of specifying run keys manually, you can use: - partition=”train”: All runs except validation and test - partition=”validation”: (‘0’, ‘11’, ‘Sherlock1’, ‘2’) - partition=”test”: (‘0’, ‘12’, ‘Sherlock1’, ‘2’)
preprocessing_str (str | None) – By default, we expect files with preprocessing string “bads+headpos+sss+notch+bp+ds”. This indicates the preprocessing steps: bads+headpos+sss+notch+bp+ds means the data has been processed for bad channel removal, head position adjustment, signal-space separation, notch filtering, bandpass filtering, and downsampling.
tmin (float) – Start time of the sample in seconds relative to the sliding window start. Together with tmax, defines the time window size for each sample.
tmax (float) – End time of the sample in seconds relative to the sliding window start. The number of timepoints per sample = int((tmax - tmin) * sfreq) where sfreq=250Hz. E.g., tmin=0, tmax=0.8 yields 200 timepoints per sample.
include_run_keys (list[str]) – List of specific sessions to include. Format per session: (‘0’, ‘1’, ‘Sherlock1’, ‘1’) = Subject 0, Session 1, Task Sherlock1, Run 1. You can see all valid run keys by importing RUN_KEYS from pnpl.datasets.libribrain2025.constants.
exclude_run_keys (list[str]) – List of sessions to exclude (same format as include_run_keys).
exclude_tasks (list[str]) – List of task names to exclude (e.g., [‘Sherlock1’]).
standardize (bool) – Whether to z-score normalize each channel’s MEG data using mean and std computed across all included runs. Formula: normalized_data[channel] = (raw_data[channel] - channel_mean[channel]) / channel_std[channel]
clipping_boundary (float | None) – If specified, clips all values to [-clipping_boundary, clipping_boundary]. This can help with outliers. Set to None for no clipping.
channel_means (ndarray | None) – Pre-computed channel means for standardization. If provided along with channel_stds, these will be used instead of computing from the dataset.
channel_stds (ndarray | None) – Pre-computed channel standard deviations for standardization.
include_info (bool) – Whether to include additional info dict in each sample containing dataset name, subject, session, task, run, and onset time of the sample.
oversample_silence_jitter (int) – Since the dataset is quite unbalanced (more speech than silence), you may wish to oversample the silent portions during training. This parameter allows you to specify a different stride for silent portions only, effectively oversampling them. Set to 0 for no oversampling.
preload_files (bool) – Whether to “eagerly” download all dataset files from HuggingFace when the dataset object is created (True) or “lazily” download files on demand (False). We recommend leaving this as True unless you have a specific reason not to.
stride – Controls how far (in time) you move the sliding window between consecutive samples. Instead of jumping exactly one full time_window_samples worth (tmax-tmin; the default) each time, you can specify a smaller stride to get overlapping windows. If None, defaults to time_window_samples (no overlap).
download (bool) – Whether to download files from HuggingFace if not found locally (True) or throw an error if files are missing locally (False).
- Returns:
Data samples with shape (channels, time) where channels=306 MEG channels. Labels are arrays indicating speech (1) vs silence (0) for each timepoint in the sample.
Methods
__init__(data_path[, partition, ...])LibriBrain speech vs silence classification dataset.
ensure_file_download(fpath, data_path)Class method to download a file using the sophisticated LibriBrain download system without requiring dataset instantiation.
get_speech_silence_labels_for_session(...)prefetch_files([get_event_files])Preload all required files in parallel.