pnpl.datasets.libribrain100.dataset.LibriBrain100

pnpl.datasets.libribrain100.dataset.LibriBrain100#

class pnpl.datasets.libribrain100.dataset.LibriBrain100(data_path, task, partition=None, subjects='all', corpus='all', preprocessing_str='bads+headpos+sss+notch+bp+ds', preprocessing_config=None, include_run_keys=None, exclude_run_keys=None, exclude_tasks=None, standardize=True, clipping_boundary=10.0, channel_means=None, channel_stds=None, include_info=False, preload_files=False, download=True, preload_h5=False)[source]#

Task-driven dataset for the full LibriBrain100 release.

Parameters:

data_path (str) – Local data directory. Files are arranged in a per-corpus BIDS-like layout that mirrors the Hugging Face tree (created lazily as files are downloaded).
task – Object implementing pnpl.tasks.base.TaskProtocol (e.g. pnpl.tasks.SpeechDetection, pnpl.tasks.PhonemeClassification, pnpl.tasks.WordClassification).
partition (PartitionArg) – "train", "validation", or "test". Aliases "val"/"valid" accepted. None means “no partition filter — apply only the explicit selectors”.
subjects (SubjectsArg) – Subject selector. Accepts "all" (default), "deep" (sub-0, the deep single-subject component), "broad" (sub-1..32, the broad multi-subject component), an int, a string id ("0" or "sub-0"), or a list / range of ids.
corpus (CorpusArg) – Corpus selector. Accepts "all" (default), "sherlock", "timit", "mocha", "podcasts" (aliases like "mocha-timit", "the_moth" accepted), or a list of those.
preprocessing_str (Optional[str]) – Preprocessing token used in derivative filenames; defaults to "bads+headpos+sss+notch+bp+ds".
exclude_run_keys (Optional[Sequence[Sequence[str]]]) – 4-tuples (subject, session, task, run) for explicit inclusion/exclusion. Cannot be combined with partition.
exclude_tasks (Optional[Sequence[str]]) – Task tokens (e.g. ["Sherlock1"]) to drop.
channel_stds (ndarray | None) – See pnpl.datasets.mixins.StandardizationMixin.
include_info (bool) – If True, __getitem__ returns (x, y, info).
preload_files (bool) – Eagerly download every selected file at construction time (default False for LibriBrain100 — the dataset is large enough that lazy fetching is the usual choice).
download (bool) – Enable downloading from Hugging Face.
preload_h5 (bool) – Read each H5 fully into RAM on first access.
preprocessing_config (Optional[Dict[str, Dict[str, Any]]])
include_run_keys (Optional[Sequence[Sequence[str]]])
exclude_run_keys
standardize (bool)
clipping_boundary (Optional[float])
channel_means (ndarray | None)
channel_stds

Notes

The multi-subject (broad) data has no train partition by design; subjects="broad" + partition="train" raises ValueError. For SFT workflows on broad subjects, use partition="validation" as your fine-tuning training set and partition="test" for evaluation.
Multi-subject data was only collected with the Sherlock stimuli; subjects="broad" + corpus="timit" (or any non-Sherlock corpus) raises ValueError.

Example

>>> from pnpl.datasets import LibriBrain100
>>> from pnpl.tasks import SpeechDetection
>>> ds = LibriBrain100(
...     data_path="./data/LibriBrain100",
...     task=SpeechDetection(tmin=0.0, tmax=0.5),
...     partition="train",
... )
>>> x, y = ds[0]

__init__(data_path, task, partition=None, subjects='all', corpus='all', preprocessing_str='bads+headpos+sss+notch+bp+ds', preprocessing_config=None, include_run_keys=None, exclude_run_keys=None, exclude_tasks=None, standardize=True, clipping_boundary=10.0, channel_means=None, channel_stds=None, include_info=False, preload_files=False, download=True, preload_h5=False)[source]#

Parameters:

data_path (str)
partition (str | None)
subjects (str | int | Sequence[str | int] | range | None)
corpus (str | Sequence[str] | None)
preprocessing_str (str | None)
preprocessing_config (Dict[str, Dict[str, Any]] | None)
include_run_keys (Sequence[Sequence[str]] | None)
exclude_run_keys (Sequence[Sequence[str]] | None)
exclude_tasks (Sequence[str] | None)
standardize (bool)
clipping_boundary (float | None)
channel_means (ndarray | None)
channel_stds (ndarray | None)
include_info (bool)
preload_files (bool)
download (bool)
preload_h5 (bool)

Methods

`__init__`(data_path, task[, partition, ...])
`calculate_standardization_params`(h5_data_loader)	Calculate channel means and stds across all runs.
`clip_sample`(sample, boundary)	Clip sample values to [-boundary, boundary].
`close_h5_files`()	Close all open H5 file handles and drop preloaded arrays.
`ensure_file`(fpath)	Ensure a file exists locally, downloading if needed.
`ensure_file_download`(fpath, data_path[, repo_id])	Class method to download a file without requiring dataset instantiation.
`get_bids_raw_path`(subject, session, task, run)	Construct path to raw BIDS MEG file.
`get_calibration_files`()	Get paths to Maxwell filter calibration files.
`get_derivatives_path`(subject, session[, ...])	Construct path to derivatives directory.
`get_events_path`(subject, session, task, run)	Construct path to events TSV file.
`get_h5_dataset`(run_key)	Get (cached) H5 dataset for a run.
`get_h5_path`(subject, session, task, run[, ...])	Construct path to H5 file.
`get_headpos_path`(subject, session, task, run)	Construct path to cached head position file.
`get_preprocessed_path`(subject, session, ...)	Construct path to preprocessed file in derivatives.
`get_sfreq_from_h5`(h5_path)	Get sampling frequency from H5 file.
`init_continuous_h5`([preload_h5])	Initialize the H5 data cache.
`load_continuous_window`(subject, session, ...)	Load a time window from continuous H5 data.
`load_continuous_window_from_sample`(sample)	Load time window from a sample tuple.
`load_head_positions`(subject, session, task, run)	Load cached head positions from CSV file.
`load_preprocessed_bids`(subject, session, ...)	Load a preprocessed FIF file from the derivatives directory.
`load_raw_bids`(subject, session, task, run[, ...])	Load raw MEG data from BIDS structure.
`prefetch_files`(file_paths)	Prefetch multiple files in parallel.
`raw_bids_exists`(subject, session, task, run)	Check if raw BIDS data exists for given identifiers.
`setup_standardization`([standardize, ...])	Set up standardization parameters.
`standardize`(data)	Apply z-score normalization and optional clipping to data.

Attributes

`HUGGINGFACE_FALLBACK_REPOS`
`HUGGINGFACE_REPO`
`broadcasted_means`
`broadcasted_stds`
`channel_means`
`channel_stds`
`label_info`
`n_channels`	Number of MEG channels (306 for MEGIN TRIUX Neo).
`n_times`
`records`	The manifest records actually loaded by this dataset.

pnpl.datasets.libribrain100.dataset.LibriBrain100

Contents

pnpl.datasets.libribrain100.dataset.LibriBrain100#