Calibration against real data — the backtest loop¶
The one structural moat: predicted behavior is checked against what actually happened, the gap is measured as a first-class metric, and corrections feed back into the personas — so calibration compounds over time.
The loop¶
PREDICT predicted_behavior primitives (canonical likelihood) + hypothesis bets
↓ …stamped BEFORE reality answers
OBSERVE record_prediction_outcome(project, prediction_ref, observed, source)
↓ bool or rate 0..1; the source Ref must resolve; Brier is DERIVED
MEASURE calibration_report → mean Brier + hit rate + the reliability curve,
↓ persisted; calibration_trend reads the series (the Brier delta)
CORRECT brief_calibration → the misses with their evidence trails →
the HOST authors corrections (update_persona / record_grounding) →
record_calibration_round stamps the round
↺ …and the next reports show whether subsequent predictions improved
- Scoring is derived, never asserted.
record_prediction_outcomekeeps the predicted likelihood and the observed value on the auditable record; the Brier score(likelihood − observed)²and the hit flag follow mechanically. Hypotheses keep their own verdict path (record_hypothesis_result); the report combines both sides. - The reliability curve groups outcomes by likelihood level (rare … certain) and compares predicted mean vs observed frequency — a calibrated cohort tracks the diagonal.
- Trend = the metric. Every
calibration_reportpersists into the eval_reports series;calibration_trendreturns the points, the first→last Brier delta, andimproving.greenmeans decisive data at or under the coin-flip line (Brier ≤ 0.25). - Corrections are host-authored (the server never writes text): patch the
trait that drove the miss (
update_persona, with the miss as the reason) or ground the persona in the observed reality (ingest_corpus→record_grounding). Correct patterns, not single data points.
Events & surface¶
prediction.scored and calibration.round_recorded ride the lifecycle-hook
transport (docs/lifecycle-hooks.md) — wire a recurring cloud job or a Slack
alert to either. MCP: record_prediction_outcome, calibration_report,
calibration_trend, brief_calibration, record_calibration_round.
CLI: calibration-report, calibration-trend.
Survey responses are a natural outcome source: import real responses
(import_survey_responses), then point record_prediction_outcome.source at
the survey record. The archetype packs (sonaloop-data) consume the calibrated
corpora this loop produces.