LibriBrain100#
pnpl.datasets.LibriBrain100 is the unified loader for the full
LibriBrain release. Conceptually it is the union of the original
pnpl/LibriBrain
Hugging Face dataset and the extension repository
pnpl/LibriBrain2;
there is no separate LibriBrain100 repo, and the library hides the
split from normal users — pass a record’s identifier and the loader
fetches it from whichever underlying repo owns it.
The release is structured around a deep / broad distinction:
Subject 0 (deep, ~80h): the entire Sherlock Holmes canon (
Sherlock1..Sherlock9) plus TIMIT, MOCHA-TIMIT, and 30 ‘The Moth’ podcast stories.Subjects 1–32 (broad, ~44 min each): Sherlock1 ses-11 and ses-12 only — the same chapters that have always been the validation / test split for the deep subject.
All recordings come from a MEGIN TRIUX™ Neo system (306 sensors, 102 magnetometers + 204 gradiometers); the serialised H5 files are downsampled to 250 Hz.
Quickstart#
from pnpl.datasets import LibriBrain100
from pnpl.tasks import SpeechDetection
ds = LibriBrain100(
data_path="./data/LibriBrain100",
task=SpeechDetection(tmin=0.0, tmax=0.5),
partition="train",
)
x, y = ds[0] # x: (channels, time), y: per-time-step speech label
The first construction downloads any missing files from Hugging Face; subsequent calls read the local cache.
Selectors#
Two new selectors compose with the existing partition /
include_run_keys / exclude_* arguments:
LibriBrain100(
data_path=...,
task=...,
partition="train"|"validation"|"test"|None,
subjects="all"|"deep"|"broad"|0|"sub-0"|[1,2,3]|range(1, 33),
corpus="all"|"sherlock"|"timit"|"mocha"|"podcasts"|[...],
)
subjects= accepts:
Form |
Meaning |
|---|---|
|
every subject (sub-0 + sub-1..32) |
|
sub-0 only |
|
sub-1..32 only |
|
a single subject |
|
a list of subjects |
corpus= accepts "all" (default), "sherlock", "timit",
"mocha", "podcasts", or a list of those. Aliases like
"mocha-timit", "mochatimit", "the_moth", "moth" are accepted
for convenience.
Validation rules raised early#
The loader rejects two combinations up front rather than silently returning an empty dataset:
subjects="broad" + corpus="timit"(or any non-Sherlock corpus) — multi-subject data was only collected with the Sherlock stimuli.subjects="broad" + partition="train"— no train partition is assigned to the broad component by default. For SFT workflows on the broad subjects, usepartition="validation"as your fine-tuning training set andpartition="test"for evaluation.
Common usage#
# Subject 0 only — the deep component (~80 hours):
ds = LibriBrain100(data_path=..., task=..., partition="train",
subjects="deep")
# Only TIMIT, only on subject 0:
ds = LibriBrain100(data_path=..., task=...,
subjects="deep", corpus="timit")
# Multiple corpora at once, still only on subject 0:
ds = LibriBrain100(data_path=..., task=...,
subjects="deep", corpus=["timit", "mocha"])
Per-task wrappers#
If you do not want to instantiate a task object directly:
from pnpl.datasets import (
LibriBrain100Speech,
LibriBrain100Phoneme,
LibriBrain100Word,
)
speech_ds = LibriBrain100Speech(data_path="./data/LibriBrain100",
partition="train", subjects="deep",
corpus="sherlock")
phoneme_ds = LibriBrain100Phoneme(data_path="./data/LibriBrain100",
partition="validation",
subjects="deep", corpus="sherlock",
tmin=-0.2, tmax=0.6)
word_ds = LibriBrain100Word(data_path="./data/LibriBrain100",
include_run_keys=[("0", "1", "TheMoth", "1")])
The wrappers take the same subjects / corpus arguments as
LibriBrain100.
Notes on partial uploads#
LibriBrain2 is uploaded incrementally. While files are still landing, records that aren’t yet downloadable are skipped with a single grouped warning at construction time, so the loader stays usable as the upload progresses. If the warning shows up where you don’t expect it, run
from pnpl.datasets.libribrain100 import RUN_RECORDS
print(len(RUN_RECORDS), "records in manifest")
to confirm what the manifest expects.