Simulation I/O¶
The simulation I/O module processes the raw simulation output and provides a uniform, reduced file structure. These methods are required for utilizing the rest of the package, since all later methods expect this specific data format and layout.
What This Module Does¶
- Condenses raw per-simulation HDF5 outputs into a single structured HDF5 file
- Computes flat-sky kSZ angular power spectra and writes them into the condensed file
- Builds ML-ready training arrays used by the rest of the package
This module specifically handles these steps in the workflow: Condense Simulation Output → Compute Power Spectra → Build Training Data.
When To Use It¶
Use this module when you are starting from raw or partially processed simulation output. If you already have a condensed HDF5 file with /cl and /training written, you can usually move on to the data loading and training modules instead.
The raw kSZ 2LPT simulation produces data that is unnecessary for the purpose of emulating the kSZ angular power spectrum. The raw data is also split into two HDF5 files per simulation (obs_grids and pk_arrays), so this can be compacted into a single HDF5 file. By removing the unnecessary information and condensing to a single HDF5 file, the overall file size is reduced by 75%, making it easier to store, compress, and transfer.
The kSZ angular power spectrum is also not computed by the simulation code. Therefore, the next step in the process is computing the kSZ angular power spectrum and inserting it into a subdirectory of the condensed HDF5.
The last step in preparing the raw simulation is building the training arrays to allow for quick execution of training scripts. These training arrays store the input parameters (zmean_zre, alpha_zre, kb_zre, b0_zre), targets (binned kSZ angular power spectrum), and reference ell bins. There are important decisions that must be made when preparing the training data, such as whether the target should be the angular power spectrum or rescaled angular power spectrum, and whether the target should be transformed. Completing this step and saving it into the file allows for multiple different datasets for training that can be easily identified.
There are configuration classes for each of these steps, allowing the settings to be saved for reproducibility of experiments. Below are in-depth explanations of the main methods and configurations.
Typical Workflow¶
from pathlib import Path
from reionemu import (
BuildXYConfig,
ClConfig,
CondenseConfig,
add_cl_to_condensed_h5,
build_and_write_training,
condense_sim_root,
)
raw_sim_root = Path("path/to/raw/simulations")
condensed_h5 = Path("path/to/condensed.h5")
condense_sim_root(
sim_root=raw_sim_root,
out_path=condensed_h5,
config=CondenseConfig(),
)
add_cl_to_condensed_h5(
condensed_h5,
config=ClConfig(),
)
build_and_write_training(
condensed_h5,
config=BuildXYConfig(),
)
Condense H5¶
This is the first step in the workflow that extracts the data, condenses it into a single HDF5, and verifies the payloads. Both per-simulation HDF5 files are read. From the pk_arrays file, pk_tt, xmval_list, zval_list, alpha_zre, b0_zre, kb_zre, zmean_zre, and tau are extracted. From obs_grids, ksz_map, Tcmb0, and theta_max_ksz are extracted.
Purpose¶
Use this step to convert raw per-simulation output into the condensed HDF5 layout expected by the rest of the pipeline.
Configuration¶
This dataclass provides options for overwrite control and an extra validation step.
@dataclass(frozen=True)
class CondenseConfig:
overwrite: bool = True
require_obs_and_pk: bool = True
- overwrite: When True, an existing condensed HDF5 file at
out_pathwill be replaced. When False,condense_sim_rootraises aFileExistsErrorifout_pathalready exists. This option does not append to an existing file or skip already-written simulations. - require_obs_and_pk: When True, simulations missing either an
obs_gridsfile or apk_arraysfile are counted as missing-file skips before reading. When False, the function attempts to read whichever files are present, but the payload is still validated before writing, so simulations missing required fields from either file will be skipped as validation errors. For most cases this should remain True.
Main Entry Point¶
This is the method that orchestrates the process of extracting the necessary data and condensing it into a single HDF5 file.
def condense_sim_root(
sim_root: Path,
out_path: Path,
*,
config: CondenseConfig = CondenseConfig(),
sim_prefix: str = "sim",
file_description: str = "Condensed simulation outputs for kSZ 2LPT emulator.",
version: int = 1,
progress_callback: Optional[Callable[[int, int], None]] = None,
) -> CondenseStats:
| Parameter | Type | Default | Description |
|---|---|---|---|
| sim_root | Path |
Required | Path to the raw kSZ 2LPT simulation output |
| out_path | Path |
Required | Path to the condensed output .h5 file |
| config | CondenseConfig |
Defaults | Configuration dataclass |
| sim_prefix | str |
"sim" | Subfolder name prefix to include |
| file_description | str |
"Condensed simulation outputs for kSZ 2LPT emulator." | Description of the output file |
| version | int |
1 | Versioning option |
| progress_callback | Callable |
Optional | Optional callable(completed, total) called after each sim |
Returns¶
The method returns a CondenseStats object containing the number of sims written, sims skipped due to missing obs_grids or pk_arrays, sims skipped due to a read error, and a catch-all for sims skipped due to a validation error. There is also a property that returns the total sims skipped. The class is outlined below.
@dataclass(frozen=True)
class CondenseStats:
written: int
skipped_missing_obs_pk: int
skipped_read_error: int
skipped_validation_error: int
@property
def skipped_total(self) -> int:
return (
self.skipped_missing_obs_pk
+ self.skipped_read_error
+ self.skipped_validation_error
)
After condense_sim_root, the file will have a structure like:
Top-Level:
['sims']
sims:
['sim0', 'sim1', ... , 'sim<n>']
sim<n>:
['output', 'params']
params:
['alpha_zre', 'b0_zre', 'kb_zre', 'zmean_zre']
output:
['Tcmb0', 'ksz_map', 'pk_tt', 'tau', 'theta_max_ksz', 'xmval_list', 'zval_list']
Written Products¶
params/alpha_zre,params/b0_zre,params/kb_zre,params/zmean_zre: The scalar reionization parameters used later as the input features for training.output/ksz_map: The kSZ map used later to compute the angular power spectrum.output/Tcmb0: The CMB temperature metadata used in the map-to-microkelvin conversion.output/theta_max_ksz: The angular size metadata used when computing the flat-sky multipole grid.output/pk_tt,output/xmval_list,output/zval_list,output/tau: Additional simulation products and metadata preserved in the condensed file for traceability and later use.
Typical Usage¶
from pathlib import Path
from reionemu import CondenseConfig, condense_sim_root
stats = condense_sim_root(
sim_root=Path("path/to/raw/simulations"),
out_path=Path("path/to/condensed.h5"),
config=CondenseConfig(overwrite=True, require_obs_and_pk=True),
)
print(stats.written, stats.skipped_total)
Compute kSZ Angular Power Spectrum¶
The kSZ 2LPT simulation does not compute the kSZ angular power spectrum, but it does provide the kSZ map (ksz_map) and the metadata needed to compute it. A flat-sky angular power spectrum is computed following a standard method.
Purpose¶
Use this step after condensation when you want to attach binned power-spectrum products to each simulation in the condensed file.
Configuration¶
A dataclass is used to adjust how the computation is done.
@dataclass(frozen=True)
class ClConfig:
nbins: int = 5
ell_cut: float = 1000.0
overwrite: bool = True
sims_group: str = "sims"
This dataclass provides four options, two of which are especially important in the calculation of the angular power spectrum.
- nbins: Controls the number of bins used when binning
ell - ell_cut: Controls the minimum
ellretained. Full-resolution bins with centers< ell_cutare discarded. - overwrite: If True, any existing
/cldirectory will be overwritten. - sims_group: Name of the top-level sims group in the HDF5 file.
Main Entry Point¶
This is the main method used to compute the angular power spectrum from the condensed HDF5 file and save the results back into the same file.
def add_cl_to_condensed_h5(
h5_path: Path,
*,
config: ClConfig = ClConfig(),
progress_callback: Optional[Callable[[int, int], None]] = None,
) -> int:
| Parameter | Type | Default | Description |
|---|---|---|---|
| h5_path | Path |
Required | Path to the condensed .h5 file |
| config | ClConfig |
Defaults | Configuration dataclass for power-spectrum computation |
| progress_callback | Callable |
Optional | Optional callable (completed, total) called after each sim |
Returns¶
This method writes the computed angular power spectrum products into each simulation's /cl group in the condensed HDF5 file. It returns the number of simulations that were updated.
Top-Level:
['sims']
sims:
['sim0', 'sim1', ... , 'sim<n>']
sim<n>:
['cl', 'output', 'params']
params:
['alpha_zre', 'b0_zre', 'kb_zre', 'zmean_zre']
output:
['Tcmb0', 'ksz_map', 'pk_tt', 'tau', 'theta_max_ksz', 'xmval_list', 'zval_list']
cl:
['cl_ksz', 'dcl', 'dl_ksz', 'ell']
Written Products¶
cl/ell: The final binned multipole centers kept after theell_cutfiltering and any coarse rebinning.cl/cl_ksz: The binnedC_ellvalues computed from the flat-sky power spectrum of the kSZ map.cl/dl_ksz: The correspondingD_ellvalues, computed fromC_ellusingD_ell = ell * (ell + 1) * C_ell / (2 * pi).cl/dcl: An uncertainty estimate per bin based on the number of Fourier modes contributing to that bin.
The code first builds a full-resolution spectrum, removes bins below ell_cut, and then rebins the remaining high-ell part into nbins coarse bins if needed. For the full-resolution bins, dcl is computed as cl / sqrt(counts), where counts is the number of Fourier modes in the bin. After coarse rebinning, the same idea is applied using the total mode counts in each coarse bin. The final cl and dcl values are then corrected by the Hann-window normalization factor.
Typical Usage¶
from pathlib import Path
from reionemu import ClConfig, add_cl_to_condensed_h5
updated = add_cl_to_condensed_h5(
Path("path/to/condensed.h5"),
config=ClConfig(nbins=5, ell_cut=1000.0, overwrite=True),
)
print(updated)
Build Training Data¶
This is the last main step in the simulation I/O workflow. After the condensed file has the /cl group for each simulation, the next step is building the training arrays used by the emulator training code. This step reads the simulation parameters and the selected power-spectrum product, checks that the ell bins are consistent across simulations, applies an optional target transform, and writes the result into a /training group.
This step is important because it converts the per-simulation structure into the ML-ready layout expected by the rest of the package. Instead of reading values from each sim one by one during training, the package can load a single set of arrays for inputs, targets, reference ell bins, parameter names, and simulation ids.
Purpose¶
Use this step when you want to convert the per-simulation condensed layout into the array-based format used by the training and evaluation code.
Configuration¶
This dataclass controls which groups are read, which target is used, and whether the target should be transformed before training.
@dataclass(frozen=True)
class BuildXYConfig:
sims_group: str = "sims"
params_group: str = "params"
cl_group: str = "cl"
param_names: Tuple[str, ...] = ("zmean_zre", "alpha_zre", "kb_zre", "b0_zre")
y_source: str = "dl_ksz"
y_transform: str = "ln"
eps: float = 1e-30
- sims_group: Name of the top-level sims group in the HDF5 file.
- params_group: Name of the parameter subgroup under each simulation.
- cl_group: Name of the power-spectrum subgroup under each simulation.
- param_names: Ordered parameter names used to construct the input matrix
X. - y_source: Which power-spectrum product should be used as the training target. This is typically
dl_kszorcl_ksz. - y_transform: Optional transform applied to the target values. The current options are
none,log10, andln. - eps: Small constant added before a logarithm is applied, mainly to avoid issues around zero.
Main Entry Point¶
This is the main method used to construct the in-memory training arrays from the condensed HDF5 file.
def build_training_arrays(
h5_path: Path,
*,
config: BuildXYConfig = BuildXYConfig(),
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray, BuildStats]:
| Parameter | Type | Default | Description |
|---|---|---|---|
| h5_path | Path |
Required | Path to the condensed .h5 file |
| config | BuildXYConfig |
Defaults | Configuration dataclass for building the training data |
Returns¶
The method returns the input matrix X, target matrix Y, reference ell bin centers, simulation ids, parameter names, and a BuildStats object. This allows the arrays to be inspected or modified before they are written to disk.
@dataclass(frozen=True)
class BuildStats:
total_sims: int
processed: int
skipped_missing_params: int
skipped_missing_cl: int
skipped_inconsistent_ell: int
skipped_non_finite: int
This stats object keeps track of how many simulations were processed and how many were skipped for different reasons. This is useful because it makes it easier to detect if a large part of the dataset was excluded due to missing parameters, missing power-spectrum products, inconsistent ell bins, or non-finite values.
Write Entry Point¶
Once the arrays have been built, they can be written into the condensed HDF5 file using the convenience method below.
def build_and_write_training(
h5_path: Path,
*,
config: BuildXYConfig = BuildXYConfig(),
overwrite: bool = True
) -> int:
| Parameter | Type | Default | Description |
|---|---|---|---|
| h5_path | Path |
Required | Path to the condensed .h5 file |
| config | BuildXYConfig |
Defaults | Configuration dataclass for building the training data |
| overwrite | bool |
True |
If True, an existing /training group will be overwritten |
Returns¶
This method returns the number of simulations included in the written training dataset. Unlike build_training_arrays, this method writes the results directly into the /training group of the condensed HDF5 file.
After build_and_write_training, the training group will have a structure like:
training:
['X', 'Y', 'ell', 'param_names', 'sim_ids']
X:
[array([zmean_zre, alpha_zre, kb_zre, b0_zre]), ...]
Y:
[array([...]), ...]
ell:
[np.float64(...), np.float64(...), ...]
param_names:
[b'zmean_zre', b'alpha_zre', b'kb_zre', b'b0_zre']
sim_ids:
[b'sim0', b'sim1', b'sim2', ... , b'sim<n>']
Written Products¶
training/X: The input feature matrix built fromBuildXYConfig.param_names.training/Y: The target matrix built fromBuildXYConfig.y_source, with the optional transform fromBuildXYConfig.y_transformapplied.training/ell: The referenceellbin centers matching the columns ofY.training/param_names: The ordered parameter names used to build the columns ofX.training/sim_ids: The simulation ids included in the final dataset.
The /training group also stores metadata attributes describing the build configuration, including y_source, y_transform, eps, the number of samples, the number of parameters, the ell bin count, and the skip statistics when they are available.
Typical Usage¶
from pathlib import Path
from reionemu import BuildXYConfig, build_and_write_training
count = build_and_write_training(
Path("path/to/condensed.h5"),
config=BuildXYConfig(y_source="dl_ksz", y_transform="ln"),
overwrite=True,
)
print(count)
Notes¶
In practice, the full simulation I/O workflow is usually run in this order:
condense_sim_rootadd_cl_to_condensed_h5build_and_write_training
At the end of these steps, the condensed HDF5 file contains the raw outputs needed for traceability, the computed power-spectrum products, and the final training arrays used by the rest of the package.
Another useful point to keep in mind is that this stage is where many data-quality issues will first show up clearly. Missing files, missing groups, inconsistent ell bins, and non-finite values will all surface here, so this part of the workflow is also an important validation step before training begins.