pnpl.datasets.libribrain100.dataset.LibriBrain100#
- class pnpl.datasets.libribrain100.dataset.LibriBrain100(data_path, task, partition=None, subjects='all', corpus='all', preprocessing_str='bads+headpos+sss+notch+bp+ds', preprocessing_config=None, include_run_keys=None, exclude_run_keys=None, exclude_tasks=None, standardize=True, clipping_boundary=10.0, channel_means=None, channel_stds=None, include_info=False, preload_files=False, download=True, preload_h5=False)[source]#
Task-driven dataset for the full LibriBrain100 release.
- Parameters:
data_path (str) – Local data directory. Files are arranged in a per-corpus BIDS-like layout that mirrors the Hugging Face tree (created lazily as files are downloaded).
task – Object implementing
pnpl.tasks.base.TaskProtocol(e.g.pnpl.tasks.SpeechDetection,pnpl.tasks.PhonemeClassification,pnpl.tasks.WordClassification).partition (PartitionArg) –
"train","validation", or"test". Aliases"val"/"valid"accepted.Nonemeans “no partition filter — apply only the explicit selectors”.subjects (SubjectsArg) – Subject selector. Accepts
"all"(default),"deep"(sub-0, the deep single-subject component),"broad"(sub-1..32, the broad multi-subject component), an int, a string id ("0"or"sub-0"), or a list / range of ids.corpus (CorpusArg) – Corpus selector. Accepts
"all"(default),"sherlock","timit","mocha","podcasts"(aliases like"mocha-timit","the_moth"accepted), or a list of those.preprocessing_str (Optional[str]) – Preprocessing token used in derivative filenames; defaults to
"bads+headpos+sss+notch+bp+ds".exclude_run_keys (Optional[Sequence[Sequence[str]]]) – 4-tuples
(subject, session, task, run)for explicit inclusion/exclusion. Cannot be combined withpartition.exclude_tasks (Optional[Sequence[str]]) – Task tokens (e.g.
["Sherlock1"]) to drop.channel_stds (ndarray | None) – See
pnpl.datasets.mixins.StandardizationMixin.include_info (bool) – If True,
__getitem__returns(x, y, info).preload_files (bool) – Eagerly download every selected file at construction time (default
Falsefor LibriBrain100 — the dataset is large enough that lazy fetching is the usual choice).download (bool) – Enable downloading from Hugging Face.
preload_h5 (bool) – Read each H5 fully into RAM on first access.
preprocessing_config (Optional[Dict[str, Dict[str, Any]]])
include_run_keys (Optional[Sequence[Sequence[str]]])
exclude_run_keys
standardize (bool)
clipping_boundary (Optional[float])
channel_means (ndarray | None)
channel_stds
Notes
The multi-subject (broad) data has no train partition by design;
subjects="broad" + partition="train"raisesValueError. For SFT workflows on broad subjects, usepartition="validation"as your fine-tuning training set andpartition="test"for evaluation.Multi-subject data was only collected with the Sherlock stimuli;
subjects="broad" + corpus="timit"(or any non-Sherlock corpus) raisesValueError.
Example
>>> from pnpl.datasets import LibriBrain100 >>> from pnpl.tasks import SpeechDetection >>> ds = LibriBrain100( ... data_path="./data/LibriBrain100", ... task=SpeechDetection(tmin=0.0, tmax=0.5), ... partition="train", ... ) >>> x, y = ds[0]
- __init__(data_path, task, partition=None, subjects='all', corpus='all', preprocessing_str='bads+headpos+sss+notch+bp+ds', preprocessing_config=None, include_run_keys=None, exclude_run_keys=None, exclude_tasks=None, standardize=True, clipping_boundary=10.0, channel_means=None, channel_stds=None, include_info=False, preload_files=False, download=True, preload_h5=False)[source]#
- Parameters:
data_path (str)
partition (str | None)
subjects (str | int | Sequence[str | int] | range | None)
corpus (str | Sequence[str] | None)
preprocessing_str (str | None)
preprocessing_config (Dict[str, Dict[str, Any]] | None)
include_run_keys (Sequence[Sequence[str]] | None)
exclude_run_keys (Sequence[Sequence[str]] | None)
exclude_tasks (Sequence[str] | None)
standardize (bool)
clipping_boundary (float | None)
channel_means (ndarray | None)
channel_stds (ndarray | None)
include_info (bool)
preload_files (bool)
download (bool)
preload_h5 (bool)
Methods
__init__(data_path, task[, partition, ...])calculate_standardization_params(h5_data_loader)Calculate channel means and stds across all runs.
clip_sample(sample, boundary)Clip sample values to [-boundary, boundary].
close_h5_files()Close all open H5 file handles and drop preloaded arrays.
ensure_file(fpath)Ensure a file exists locally, downloading if needed.
ensure_file_download(fpath, data_path[, repo_id])Class method to download a file without requiring dataset instantiation.
get_bids_raw_path(subject, session, task, run)Construct path to raw BIDS MEG file.
get_calibration_files()Get paths to Maxwell filter calibration files.
get_derivatives_path(subject, session[, ...])Construct path to derivatives directory.
get_events_path(subject, session, task, run)Construct path to events TSV file.
get_h5_dataset(run_key)Get (cached) H5 dataset for a run.
get_h5_path(subject, session, task, run[, ...])Construct path to H5 file.
get_headpos_path(subject, session, task, run)Construct path to cached head position file.
get_preprocessed_path(subject, session, ...)Construct path to preprocessed file in derivatives.
get_sfreq_from_h5(h5_path)Get sampling frequency from H5 file.
init_continuous_h5([preload_h5])Initialize the H5 data cache.
load_continuous_window(subject, session, ...)Load a time window from continuous H5 data.
load_continuous_window_from_sample(sample)Load time window from a sample tuple.
load_head_positions(subject, session, task, run)Load cached head positions from CSV file.
load_preprocessed_bids(subject, session, ...)Load a preprocessed FIF file from the derivatives directory.
load_raw_bids(subject, session, task, run[, ...])Load raw MEG data from BIDS structure.
prefetch_files(file_paths)Prefetch multiple files in parallel.
raw_bids_exists(subject, session, task, run)Check if raw BIDS data exists for given identifiers.
setup_standardization([standardize, ...])Set up standardization parameters.
standardize(data)Apply z-score normalization and optional clipping to data.
Attributes
HUGGINGFACE_FALLBACK_REPOSHUGGINGFACE_REPObroadcasted_meansbroadcasted_stdschannel_meanschannel_stdslabel_infon_channelsNumber of MEG channels (306 for MEGIN TRIUX Neo).
n_timesrecordsThe manifest records actually loaded by this dataset.