Capacity and cost guardrails
Capacity and cost guardrails
Purpose
Define the single-source contract for ScrumAI runtime capacity and spend protections exposed by Portal control-plane routes.
Canonical architecture (no hidden second architecture)
- Runtime truth is
apps/portal/app.pybacked by Portal/TzenBoard runtime signals. - Contract exposure is
/api/scrumai/*. - This contract does not introduce a second scheduler, worker fleet, or hidden control-plane.
Contract surfaces
GET /api/scrumai/guardrails- Contract:
scrumai_capacity_cost_guardrails_v1 - Purpose: explicit limits, behavior contract, environment expectations, and runtime mapping.
GET /api/scrumai/queue/state- Contract:
scrumai_queue_state_v1 - Purpose: dispatch readiness +
summary.monthly_cost_guard. GET /api/scrumai/agents/<agent_key>/runtime- Contract:
scrumai_role_runtime_v1 - Purpose: per-role dispatch readiness +
monthly_cost_guardsnapshot. GET /api/scrumai/bootstrap- Contract:
scrumai_control_plane_bootstrap_v1 - Purpose: minimum control-plane route map including guardrails surface.
Hard limits (runtime-enforced contract)
Exposed via /api/scrumai/guardrails.hard_limits:
global_concurrency = 1single_fallback_max_attempts = 2(one primary + one fallback)failure_max_attempts = 1for timeout/crash terminal pathcircuit_breaker_thresholdfromPORTAL_SCRUMAI_CIRCUIT_BREAKER_THRESHOLD(default 3)- live dispatch timeout bounds:
15..900seconds - SCRUMAI_APPLY write scope allowlisted to
docs/evidence/generated
Alert / block / fallback behavior contract
The runtime must name these behaviors explicitly (not implicit prose only):
| Behavior | Current trigger class | Current observable surfaces | Expected outcome | |---|---|---|---| | alert | blocked dispatch, failed terminal dispatch, breached monthly ceiling | /api/scrumai/hub latest dispatch issue; /api/scrumai/queue/state blocked reason code; /api/scrumai/guardrails monthly cost state | machine-readable reason code is visible to operators | | block | lane health gate failure, telemetry write failure, allowlist rejection, cost breach, breaker trip | /api/scrumai/queue/state, /api/scrumai/agents/<agent_key>/runtime, /api/scrumai/agents/<agent_key>/runtime/start | dispatch denied or terminalized with explicit reason | | fallback | primary dispatch failure on fallback-enabled lane | /api/scrumai/agents/<agent_key>/runtime/start and .../sync dispatch audit | max one fallback attempt, then terminal/blocked outcome |
Environment expectations
| Profile | Intended use | Guardrail expectation | |---|---|---| | prod | live operator lane | enforce strict single concurrency and bounded retry/fallback; expect monthly ceiling to be configured; treat force_dispatch as emergency-only override with audit | | dev_test | migration/test/proof lane | same hard safety caps (single concurrency + bounded retry/fallback), but ceiling may be disabled (0.0) and controlled force_dispatch can be used to exercise guardrails |
Environment resolution is exposed via /api/scrumai/guardrails.environment_expectations. Unknown APP_ENV values currently default to the dev_test profile and are marked explicitly in the payload.
Runtime knob mapping (contract -> live runtime)
Exposed via /api/scrumai/guardrails.runtime_knob_mapping:
| Contract knob | Runtime source today | Operator-tunable now | |---|---|---| | max_parallel_jobs | fixed in app.py (global_concurrency=1) | no | | single_fallback_max_attempts | SCRUMAI_SINGLE_FALLBACK_MAX_ATTEMPTS constant | no | | failure_max_attempts | SCRUMAI_FAILURE_MAX_ATTEMPTS constant | no | | circuit_breaker_threshold | PORTAL_SCRUMAI_CIRCUIT_BREAKER_THRESHOLD | yes | | monthly_cost_window_days | PORTAL_SCRUMAI_MONTHLY_COST_WINDOW_DAYS | yes | | monthly_cost_ceiling_usd | PORTAL_SCRUMAI_MONTHLY_COST_CEILING_USD | yes | | monthly_cost_review_day_utc | PORTAL_SCRUMAI_MONTHLY_COST_REVIEW_DAY_UTC | yes | | live_dispatch_timeout_seconds | live dispatch timeout clamp in app.py (15..900) | yes (request/env path) |
Known gaps (explicitly tracked)
Exposed via /api/scrumai/guardrails.known_gaps:
alert_delivery_sink_missing: alert semantics are visible in API contracts, but no pager/webhook sink is wired yet.prod_force_dispatch_policy_not_hard_enforced: prod policy is documented as emergency-only, but runtime currently trusts callerforce_dispatch+ audit trail.global_concurrency_not_operator_tunable: max parallel jobs remains fixed at 1 in code.
Monthly cost guard and review contract
Runtime ceiling controls
PORTAL_SCRUMAI_MONTHLY_COST_WINDOW_DAYS(default 30)PORTAL_SCRUMAI_MONTHLY_COST_CEILING_USD(default 0.0 = disabled)PORTAL_SCRUMAI_MONTHLY_COST_REVIEW_DAY_UTC(1..28, default 1)
Block behavior
- Cost snapshot is computed from
inference_eventsover configured UTC rolling window. - When
ceiling_usd > 0andspend_usd >= ceiling_usd, dispatch is marked blocked on queue/runtime surfaces. - Block reason includes
MONTHLY_COST_CEILING_REACHED. - Forced dispatch remains explicit operator override via runtime request payload (
telemetry.force_dispatch).
Monthly review process
1. Confirm spend and remote spend for current window. 2. Decide if ceiling/env knobs need adjustment for next month. 3. Record decision, owner, and timestamp in the evidence artifact.
- Cadence: monthly on configured UTC review day.
- Owners: Product Owner, Scrum Master, Dan.
- Required inputs:
GET /api/scrumai/guardrails(monthly_cost_guard, review due fields)inference_eventsspend totals for local/remote split- Required evidence artifact pattern:
docs/evidence/task-245-monthly-cost-review-YYYY-MM.md- Required checks: