Skip to content

Data Loading

The data loading module reads the training arrays written by the simulation I/O workflow and prepares them for PyTorch training. It is the bridge between the processed HDF5 dataset and the model/training utilities.

What This Module Does

  • Loads X, Y, and ell arrays from the condensed HDF5 /training group
  • Validates that loaded arrays have compatible shapes and finite values
  • Creates reproducible, fraction-based train/validation/test splits
  • Builds PyTorch DataLoader objects for each split
  • Optionally fits and applies feature-wise normalization using training split statistics only

This module specifically handles this step in the workflow: Load Training Data -> Create DataLoaders.

When To Use It

Use this module after the simulation I/O workflow has produced a condensed HDF5 file with a /training group. If your file only contains /sims and /cl, run build_and_write_training(...) first.

The expected HDF5 layout is:

training:
 ['X', 'Y', 'ell', 'param_names', 'sim_ids']

The required arrays are:

  • training/X: Input parameter matrix with shape (n_samples, n_parameters).
  • training/Y: Target spectrum matrix with shape (n_samples, n_ell_bins).
  • training/ell: Reference multipole bin centers with shape (n_ell_bins,).

Typical Workflow

from pathlib import Path

from reionemu import DataLoaderConfig, load_training_arrays, make_dataloaders

condensed_h5 = Path("path/to/condensed.h5")

X, Y, ell = load_training_arrays(condensed_h5)

config = DataLoaderConfig(
    batch_size=32,
    seed=42,
    shuffle_train=True,
    normalize_X=True,
    normalize_Y=False,
)

loaders, normalizers, ell = make_dataloaders(
    condensed_h5,
    split={"train": 0.8, "val": 0.2},
    config=config,
)

train_loader = loaders["train"]
val_loader = loaders["val"]
X_normalizer = normalizers["X"]
Y_normalizer = normalizers["Y"]

Load Training Arrays

load_training_arrays is the lowest-level public helper in this module. It reads the arrays from /training, casts them to float32, validates them, and returns NumPy arrays.

Purpose

Use this function when you want direct access to the training arrays without constructing PyTorch loaders.

Main Entry Point

def load_training_arrays(
    h5_path: Path,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
Parameter Type Default Description
h5_path Path Required Path to the condensed HDF5 file containing /training

Returns

This function returns a tuple:

Return Type Shape Description
X np.ndarray (n_samples, n_parameters) Input simulation parameters
Y np.ndarray (n_samples, n_ell_bins) Target spectra
ell np.ndarray (n_ell_bins,) Multipole bin centers for Y

All three arrays are returned as float32.

Validation

Before returning, the function verifies that:

  • X and Y are two-dimensional.
  • ell is one-dimensional.
  • X and Y contain the same number of samples.
  • The second dimension of Y matches the length of ell.
  • X, Y, and ell contain only finite values.

If any check fails, a ValueError is raised with a description of the failed condition.

Typical Usage

from pathlib import Path

from reionemu import load_training_arrays

condensed_h5 = Path("path/to/condensed.h5")

X, Y, ell = load_training_arrays(condensed_h5)

print(X.shape)
print(Y.shape)
print(ell.shape)

Make DataLoaders

make_dataloaders constructs PyTorch DataLoader objects from the HDF5 training arrays. It supports any split names provided in the split dictionary, as long as one of them is named "train".

Purpose

Use this function when training or evaluating models with PyTorch. It handles array loading, validation, reproducible sample splitting, optional normalization, tensor conversion, and loader construction.

Configuration

@dataclass
class DataLoaderConfig:
    batch_size: int = 32
    seed: int = 42
    shuffle_train: bool = True
    normalize_X: bool = True
    normalize_Y: bool = False
  • batch_size: Number of samples per batch in each DataLoader.
  • seed: Random seed used when assigning samples to splits.
  • shuffle_train: Whether the training loader shuffles batches each epoch.
  • normalize_X: Whether to standardize input parameters.
  • normalize_Y: Whether to standardize target spectra.

Main Entry Point

def make_dataloaders(
    h5_path: Path,
    *,
    split: Dict[str, float] = {"train": 0.8, "val": 0.2},
    config: DataLoaderConfig = DataLoaderConfig(),
) -> Tuple[Dict[str, DataLoader], Dict[str, Optional[Normalizer]], np.ndarray]:
Parameter Type Default Description
h5_path Path Required Path to the condensed HDF5 file containing /training
split Dict[str, float] {"train": 0.8, "val": 0.2} Fraction-based split definition
config DataLoaderConfig Defaults DataLoader and normalization configuration

Split Rules

The split dictionary must:

  • Include a "train" key.
  • Sum to 1.0.
  • Contain no negative fractions.

The split order follows the insertion order of the dictionary. For each split except the last, the number of samples is computed with round(fraction * n_samples). The final split receives the remaining samples so that every sample is assigned exactly once.

For example:

split = {"train": 0.7, "val": 0.15, "test": 0.15}

returns loaders with keys "train", "val", and "test".

Returns

This function returns a tuple:

Return Type Description
loaders Dict[str, DataLoader] PyTorch loaders keyed by split name
normalizers Dict[str, Optional[Normalizer]] Fitted normalizers for "X" and "Y", or None
ell np.ndarray Multipole bin centers loaded from the HDF5 file

Each loader returns batches of (X_batch, Y_batch) tensors.

Typical Usage

from pathlib import Path

from reionemu import DataLoaderConfig, make_dataloaders

condensed_h5 = Path("path/to/condensed.h5")

loaders, normalizers, ell = make_dataloaders(
    condensed_h5,
    split={"train": 0.7, "val": 0.15, "test": 0.15},
    config=DataLoaderConfig(
        batch_size=64,
        seed=123,
        shuffle_train=True,
        normalize_X=True,
        normalize_Y=True,
    ),
)

for X_batch, Y_batch in loaders["train"]:
    print(X_batch.shape, Y_batch.shape)
    break

Normalization

The Normalizer dataclass stores feature-wise mean and standard deviation arrays.

@dataclass
class Normalizer:
    mean: np.ndarray
    std: np.ndarray

When normalize_X=True, make_dataloaders fits a normalizer on the training rows of X and applies it to the full dataset before constructing split loaders. When normalize_Y=True, the same process is applied to Y.

Normalization is computed feature-wise along axis=0:

X_standardized = (X - normalizer.mean) / normalizer.std

If a feature has zero standard deviation in the training split, its stored standard deviation is replaced with 1.0 to avoid division by zero.

Why Training-Only Statistics Matter

Validation and test samples should not influence preprocessing statistics. Fitting the normalizers on the training split only keeps validation and test metrics from seeing information outside the training data.

Using Normalizers After Prediction

If normalize_Y=True, model outputs are in standardized target space. Convert predictions back to the original target scale before plotting or interpreting spectra:

from reionemu.data.normalization import inverse_transform_standardizer

Y_pred_original = inverse_transform_standardizer(
    Y_pred_standardized,
    normalizers["Y"],
)

Common Issues

  • KeyError: 'training': The HDF5 file does not contain a /training group. Run build_and_write_training(...) first.
  • Split fractions do not sum to 1.0: Adjust the split dictionary, for example {"train": 0.8, "val": 0.2}.
  • X and Y have mismatched sample counts: Rebuild the /training group from a consistent set of simulations.
  • Non-finite values found: Inspect the upstream /cl products and the target transform used by BuildXYConfig.