Artifacts¶
The artifact module saves lightweight experiment records for reproducibility. It keeps the condensed HDF5 dataset as a data product and writes experiment metadata to a separate run directory.
What This Module Does¶
- Creates an artifact directory for one experiment run
- Writes JSON files for run identity, configs, and results
- Records which HDF5 dataset was used without modifying the HDF5 file
- Saves
Normalizerobjects as a NumPy.npzsidecar - Saves PyTorch checkpoints as
.ptfiles - Provides helpers for reading JSON and loading saved normalizers
This module specifically handles the final step in the workflow: Save Artifact.
When To Use It¶
Use this module after preparing a dataset, training a model, evaluating a model, or running a tuning/cross-validation workflow. It is intended for simple local experiment tracking, not as a replacement for a full experiment database.
The recommended path is:
- Prepare or load a condensed HDF5 training dataset.
- Train or evaluate a model.
- Call
save_artifact(...)with the configs, results, normalizers, and checkpoint you want to keep.
Output Layout¶
A typical artifact directory looks like this:
artifacts/
baseline_four_param/
info.json
configs.json
results.json
normalizers.npz
model.pt
The JSON files are human-readable. The .npz and .pt files are sidecars for data that should not be forced into JSON.
Typical Workflow¶
from pathlib import Path
import torch
from reionemu import (
DataLoaderConfig,
FitConfig,
FourParamEmulator,
fit,
make_dataloaders,
save_artifact,
)
h5_path = Path("path/to/condensed.h5")
dataloader_config = DataLoaderConfig(batch_size=32, seed=42)
fit_config = FitConfig(epochs=100, device="cpu")
loaders, normalizers, ell = make_dataloaders(
h5_path,
split={"train": 0.8, "val": 0.2},
config=dataloader_config,
)
model = FourParamEmulator()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
history = fit(
model=model,
train_loader=loaders["train"],
val_loader=loaders["val"],
optimizer=optimizer,
loss_fn=torch.nn.MSELoss(),
config=fit_config,
)
artifact_dir = save_artifact(
"baseline_four_param",
Path("artifacts"),
dataset_path=h5_path,
dataloader_config=dataloader_config,
fit_config=fit_config,
model_config={
"class_name": "FourParamEmulator",
"input_dim": 4,
"output_dim": 5,
},
optimizer_config={
"name": "AdamW",
"lr": 1e-3,
},
history=history,
normalizers=normalizers,
checkpoint=model.state_dict(),
)
print(artifact_dir)
save_artifact¶
save_artifact is the high-level entry point. It creates the artifact directory and writes all available artifact files.
def save_artifact(
name: str,
root_dir: Path,
*,
dataset_path: Path | None = None,
condense_config: Any = None,
cl_config: Any = None,
build_config: Any = None,
dataloader_config: Any = None,
fit_config: Any = None,
kfold_config: Any = None,
model_config: Mapping[str, Any] | None = None,
optimizer_config: Mapping[str, Any] | None = None,
tuning_config: Mapping[str, Any] | None = None,
results_summary: Mapping[str, Any] | None = None,
metrics: Mapping[str, Any] | None = None,
history: Mapping[str, Any] | None = None,
dataset_prep_stats: Mapping[str, Any] | None = None,
normalizers: Mapping[str, Normalizer | None] | None = None,
checkpoint: Any = None,
description: str | None = None,
) -> Path:
Main Arguments¶
| Parameter | Description |
|---|---|
name |
Run name and artifact directory name |
root_dir |
Parent directory where artifacts are stored |
dataset_path |
Optional path to the HDF5 dataset used by the experiment |
condense_config |
Optional CondenseConfig used to make the dataset |
cl_config |
Optional ClConfig used to compute spectra |
build_config |
Optional BuildXYConfig used to build training arrays |
dataloader_config |
Optional DataLoaderConfig used for dataloaders |
fit_config |
Optional FitConfig used for model training |
kfold_config |
Optional KFoldConfig used for cross-validation |
model_config |
Optional model architecture dictionary |
optimizer_config |
Optional optimizer dictionary |
tuning_config |
Optional tuning/search dictionary |
results_summary |
Optional high-level result summary |
metrics |
Optional scalar metrics dictionary |
history |
Optional training history dictionary |
dataset_prep_stats |
Optional preprocessing statistics |
normalizers |
Optional mapping such as {"X": x_norm, "Y": y_norm} |
checkpoint |
Optional PyTorch checkpoint or model state_dict |
description |
Optional human-readable run description |
Returns¶
save_artifact returns the created artifact directory as a Path.
JSON Files¶
info.json¶
info.json is the manifest for the run. It records the run name, creation time, dataset summary, and the artifact files that were written.
{
"artifacts": {
"configs": "configs.json",
"model_checkpoint": "model.pt",
"normalizers": "normalizers.npz",
"results": "results.json"
},
"created_at": "2026-04-27T18:30:12.123456+00:00",
"dataset": {
"fingerprint": {
"file_size_bytes": 123456789,
"modified_at": "2026-04-27T17:10:05.000000+00:00",
"path": "/path/to/condensed.h5"
},
"n_parameters": 4,
"n_samples": 100,
"n_targets": 5,
"param_names": ["zmean_zre", "alpha_zre", "kb_zre", "b0_zre"],
"path": "/path/to/condensed.h5"
},
"description": "Baseline four-parameter emulator",
"experiment_name": "baseline_four_param",
"run_id": "baseline_four_param",
"schema_version": 1
}
configs.json¶
configs.json stores choices made before the run.
{
"data_loading": {
"dataloader": {
"batch_size": 32,
"normalize_X": true,
"normalize_Y": false,
"seed": 42,
"shuffle_train": true
}
},
"dataset_prep": {
"build_xy": null,
"cl": null,
"condense": null
},
"kfold": null,
"model": {
"class_name": "FourParamEmulator",
"input_dim": 4,
"output_dim": 5
},
"optimizer": {
"lr": 0.001,
"name": "AdamW"
},
"schema_version": 1,
"training": {
"device": "cpu",
"early_stopping_patience": null,
"epochs": 100,
"gradient_clipping": null
},
"tuning": null
}
results.json¶
results.json stores outputs from the run.
{
"created_at": "2026-04-27T18:30:12.234567+00:00",
"dataset_prep_stats": {},
"history": {
"train_loss": [0.6, 0.4, 0.25],
"val_loss": [0.7, 0.5, 0.3]
},
"metrics": {
"val_loss": 0.3,
"val_rmse": 0.12
},
"schema_version": 1,
"status": "completed",
"summary": {
"best_epoch": 47,
"best_val_loss": 0.3
}
}
Lower-Level Helpers¶
Use these helpers when you want to control individual artifact files yourself.
| Helper | Purpose |
|---|---|
create_artifact_dir(name, root_dir) |
Create the run directory |
save_configs(artifact_dir, ...) |
Write configs.json |
save_results(artifact_dir, ...) |
Write results.json |
save_info(artifact_dir, ...) |
Write info.json |
save_normalizers(artifact_dir, normalizers) |
Write normalizers.npz |
load_normalizers(path) |
Read normalizers.npz |
save_model_checkpoint(artifact_dir, checkpoint) |
Write model.pt |
dataset_summary(h5_path) |
Summarize the HDF5 training dataset |
file_fingerprint(path) |
Record path, file size, and modified time |
read_json(path) |
Read an artifact JSON file |
Notes¶
- The artifact system does not write into the condensed HDF5 file.
- JSON files are intended for readable metadata, configs, metrics, and histories.
- Normalizers are saved as
.npzbecause they contain NumPy arrays. - Model checkpoints are saved as
.ptbecause they use PyTorch serialization. dataset_summary(...)reads the HDF5 file in read-only mode and records basic/trainingmetadata when available.