Integrations for third-party trainers¶
methodic is training-framework-agnostic: you bring your own loop — a five-step
numpy fit, a HuggingFace Trainer, a hand-rolled PyTorch/JAX loop — and Chronicle
records the run. The run lifecycle covers start →
succeed/fail and heartbeats. This page covers the two integration points a
serious trainer wants on top of that:
- Environment snapshotting — capture what ran (W&B link, library/hardware versions, your own config state) so a run is reproducible and its metrics are discoverable.
- Checkpointing & resume — upload durable checkpoints during training, and resume from the latest one after a crash or a continuation run.
All of it is layered on the same Run handle (chronicle.run(exp_id, variation,
run)); none of it prescribes how you train.
Environment snapshotting¶
Record three things at the start of a run, before the training loop:
1. Link the W&B run¶
If you log metrics to Weights & Biases, link the W&B run so Chronicle (and
downstream distillation) can pull the real metric history. Call wandb.init()
first so the run id exists, then pass the triple to run.start():
import wandb
run = chronicle.run(exp_id, variation, run_idx)
wb = wandb.init(project="ripple-pde", config={"lr": 3e-4, "seed": 1234})
run.start(
wandb_run_id=wb.id,
wandb_entity=wb.entity,
wandb_project=wb.project,
wandb_dashboard_url=wb.get_url(),
)
run.start() records the (experiment, variation, run) → W&B run pointer. The
project falls back server-side to the experiment's configured wandb_project, so
at minimum pass wandb_run_id + wandb_entity. No W&B? Just call run.start()
with no arguments.
2. Snapshot the environment¶
run.upload_environment(dict) stores any JSON-serializable mapping as an
environment asset linked to the run as an output. Capture whatever makes the
run reproducible — library versions, hardware, the git SHA of your training code:
import platform, subprocess, torch
run.upload_environment({
"python": platform.python_version(),
"torch": torch.__version__,
"cuda": torch.version.cuda,
"gpus": [torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())],
"git_sha": subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip(),
"packages": subprocess.check_output(["pip", "freeze"]).decode().splitlines(),
})
3. Custom configuration state + run-specific info¶
The environment dict is free-form — namespace your own keys and put trainer config, resolved hyperparameters, dataset fingerprints, or anything else run-specific alongside the standard fields:
run.upload_environment({
# ... standard fields above ...
"trainer": {"framework": "my-trainer", "version": "2.3.1", "mixed_precision": "bf16"},
"hyperparameters": resolved_config, # your fully-resolved config
"dataset": {"name": "ripple-v3", "sha256": dataset_hash, "n_examples": 50_000},
"seed": 1234,
})
If you'd rather keep a payload as its own typed asset (e.g. a large resolved
config you want to find on its own), upload it directly and attach structured
metadata via asset_config:
run.upload_asset(
asset_type="trainer_config",
content=resolved_config, # inline JSON, auto-finalized
asset_config={"schema_version": 3},
)
See the assets guide for inline vs. presigned upload paths.
Checkpointing & resume¶
Saving checkpoints¶
Write your checkpoint to a directory, then hand the directory to the SDK — it
presigns, uploads each file directly to cloud storage, and finalizes on a
background thread, recording upload state in a local SQLite
UploadTracker so a crash mid-upload can resume:
from pathlib import Path
from methodic import UploadTracker
tracker = UploadTracker(db_path=Path("./uploads.sqlite"))
def checkpoint(step: int) -> None:
ckpt = Path(f"./out/checkpoint-{step}")
ckpt.mkdir(parents=True, exist_ok=True)
save_state(ckpt) # ← your framework writes the files
run.upload_directory_async(ckpt, asset_type="checkpoint", upload_tracker=tracker)
# ... call checkpoint(step) periodically; uploads drain in the background ...
run.succeed() # blocks until pending uploads finish
save_state(ckpt) is the only framework-specific line:
Convention: use asset_type="checkpoint" for resumable mid-training state
and asset_type="snapshot" for the final saved model (e.g. HF
save_pretrained()). The managed menlo-park runner uses
exactly these two types.
Resuming¶
A continuation run resumes from the latest ready checkpoint produced by an
earlier run of the same variation. run.latest_output() finds it — by
default it searches across all runs of the variation (the right scope, since the
checkpoint came from a prior run), filters to finalized (state == "ready")
assets, and returns the newest:
from pathlib import Path
run = chronicle.run(exp_id, variation, run_idx)
run.start()
resume_dir = Path("./resume")
ckpt = run.latest_output("checkpoint") # newest ready checkpoint, or None
if ckpt:
run.download_asset(ckpt["id"], resume_dir) # streams every component to disk
Then load it with your framework:
Re-downloading and re-uploading are safe: checkpoint uploads finalize atomically, and finalize is idempotent (see resumable uploads).
Listing outputs directly¶
latest_output() is sugar over the output-listing endpoints. To inspect or pick
a checkpoint yourself:
# Across all runs of the variation (resume scope), newest-first:
outputs = run.list_outputs(across_runs=True)
# Just this run's outputs:
outputs = run.list_outputs()
# From a Variation handle (researcher side):
outputs = chronicle.variation(exp_id, variation).list_outputs()
Each entry is an asset dict with id, asset_type, name, state,
components, and created_at. Agents driving Chronicle through the bundled
plugin MCP server can list the same way with
chronicle.list_outputs(experiment_id, variation=…, run=…).
Resuming from an input-linked checkpoint¶
Alternatively, a researcher or agent can explicitly link a checkpoint as a variation input — e.g. to branch training from another variation's checkpoint. Read it from the variation's inputs instead of its outputs: