Skip to content

Integrations for third-party trainers

methodic is training-framework-agnostic: you bring your own loop — a five-step numpy fit, a HuggingFace Trainer, a hand-rolled PyTorch/JAX loop — and Chronicle records the run. The run lifecycle covers startsucceed/fail and heartbeats. This page covers the two integration points a serious trainer wants on top of that:

  1. Environment snapshotting — capture what ran (W&B link, library/hardware versions, your own config state) so a run is reproducible and its metrics are discoverable.
  2. Checkpointing & resume — upload durable checkpoints during training, and resume from the latest one after a crash or a continuation run.

All of it is layered on the same Run handle (chronicle.run(exp_id, variation, run)); none of it prescribes how you train.

Environment snapshotting

Record three things at the start of a run, before the training loop:

If you log metrics to Weights & Biases, link the W&B run so Chronicle (and downstream distillation) can pull the real metric history. Call wandb.init() first so the run id exists, then pass the triple to run.start():

import wandb

run = chronicle.run(exp_id, variation, run_idx)

wb = wandb.init(project="ripple-pde", config={"lr": 3e-4, "seed": 1234})
run.start(
    wandb_run_id=wb.id,
    wandb_entity=wb.entity,
    wandb_project=wb.project,
    wandb_dashboard_url=wb.get_url(),
)

run.start() records the (experiment, variation, run) → W&B run pointer. The project falls back server-side to the experiment's configured wandb_project, so at minimum pass wandb_run_id + wandb_entity. No W&B? Just call run.start() with no arguments.

2. Snapshot the environment

run.upload_environment(dict) stores any JSON-serializable mapping as an environment asset linked to the run as an output. Capture whatever makes the run reproducible — library versions, hardware, the git SHA of your training code:

import platform, subprocess, torch

run.upload_environment({
    "python": platform.python_version(),
    "torch": torch.__version__,
    "cuda": torch.version.cuda,
    "gpus": [torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())],
    "git_sha": subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip(),
    "packages": subprocess.check_output(["pip", "freeze"]).decode().splitlines(),
})

3. Custom configuration state + run-specific info

The environment dict is free-form — namespace your own keys and put trainer config, resolved hyperparameters, dataset fingerprints, or anything else run-specific alongside the standard fields:

run.upload_environment({
    # ... standard fields above ...
    "trainer": {"framework": "my-trainer", "version": "2.3.1", "mixed_precision": "bf16"},
    "hyperparameters": resolved_config,          # your fully-resolved config
    "dataset": {"name": "ripple-v3", "sha256": dataset_hash, "n_examples": 50_000},
    "seed": 1234,
})

If you'd rather keep a payload as its own typed asset (e.g. a large resolved config you want to find on its own), upload it directly and attach structured metadata via asset_config:

run.upload_asset(
    asset_type="trainer_config",
    content=resolved_config,                      # inline JSON, auto-finalized
    asset_config={"schema_version": 3},
)

See the assets guide for inline vs. presigned upload paths.

Checkpointing & resume

Saving checkpoints

Write your checkpoint to a directory, then hand the directory to the SDK — it presigns, uploads each file directly to cloud storage, and finalizes on a background thread, recording upload state in a local SQLite UploadTracker so a crash mid-upload can resume:

from pathlib import Path
from methodic import UploadTracker

tracker = UploadTracker(db_path=Path("./uploads.sqlite"))

def checkpoint(step: int) -> None:
    ckpt = Path(f"./out/checkpoint-{step}")
    ckpt.mkdir(parents=True, exist_ok=True)
    save_state(ckpt)                              # ← your framework writes the files
    run.upload_directory_async(ckpt, asset_type="checkpoint", upload_tracker=tracker)

# ... call checkpoint(step) periodically; uploads drain in the background ...
run.succeed()                                    # blocks until pending uploads finish

save_state(ckpt) is the only framework-specific line:

def save_state(ckpt: Path) -> None:
    # Write whatever your loop needs to resume: weights, optimizer,
    # scheduler, RNG, step counter — as any files under `ckpt`.
    ...
def save_state(ckpt: Path) -> None:
    accelerator.save_state(str(ckpt))        # model, optimizer, scheduler, scaler, RNG
def save_state(ckpt: Path) -> None:
    torch.save(
        {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "step": step,
        },
        ckpt / "state.pt",
    )

Convention: use asset_type="checkpoint" for resumable mid-training state and asset_type="snapshot" for the final saved model (e.g. HF save_pretrained()). The managed menlo-park runner uses exactly these two types.

Resuming

A continuation run resumes from the latest ready checkpoint produced by an earlier run of the same variation. run.latest_output() finds it — by default it searches across all runs of the variation (the right scope, since the checkpoint came from a prior run), filters to finalized (state == "ready") assets, and returns the newest:

from pathlib import Path

run = chronicle.run(exp_id, variation, run_idx)
run.start()

resume_dir = Path("./resume")
ckpt = run.latest_output("checkpoint")           # newest ready checkpoint, or None
if ckpt:
    run.download_asset(ckpt["id"], resume_dir)    # streams every component to disk

Then load it with your framework:

if ckpt:
    load_state(resume_dir)                   # mirror of your save_state
if ckpt:
    accelerator.load_state(str(resume_dir))
    # or, with the HF Trainer:
    # trainer.train(resume_from_checkpoint=str(resume_dir))
if ckpt:
    state = torch.load(resume_dir / "state.pt")
    model.load_state_dict(state["model"])
    optimizer.load_state_dict(state["optimizer"])
    start_step = state["step"] + 1

Re-downloading and re-uploading are safe: checkpoint uploads finalize atomically, and finalize is idempotent (see resumable uploads).

Listing outputs directly

latest_output() is sugar over the output-listing endpoints. To inspect or pick a checkpoint yourself:

# Across all runs of the variation (resume scope), newest-first:
outputs = run.list_outputs(across_runs=True)
# Just this run's outputs:
outputs = run.list_outputs()
# From a Variation handle (researcher side):
outputs = chronicle.variation(exp_id, variation).list_outputs()

Each entry is an asset dict with id, asset_type, name, state, components, and created_at. Agents driving Chronicle through the bundled plugin MCP server can list the same way with chronicle.list_outputs(experiment_id, variation=…, run=…).

Resuming from an input-linked checkpoint

Alternatively, a researcher or agent can explicitly link a checkpoint as a variation input — e.g. to branch training from another variation's checkpoint. Read it from the variation's inputs instead of its outputs:

for asset in chronicle.variation(exp_id, variation).list_inputs():
    if asset["asset_type"] == "checkpoint":
        run.download_asset(asset["id"], resume_dir)
        break