Page navigation 76

Project

Domains

Folders · apps/

Folders · docs/

Folders · knowledge/

Folders · ops/

Folders · packages/

Folders · scripts/

scripts/ clean

Folders · tests/

tests/ clean

Folders · pods/

Folders · tools/

tools/ clean

Shared

Metadata clean

Route: /knowledge/portal/shared/capacity-and-cost-guardrails
Source: knowledge/portal/shared/capacity-and-cost-guardrails.md
Covered files: 5
Last generated: 2026-04-13T15:08:52.627606+00:00

Capacity and cost guardrails

State: deterministic sync completed

Capacity and cost guardrails

Purpose

Define the single-source contract for ScrumAI runtime capacity and spend protections exposed by Portal control-plane routes.

Canonical architecture (no hidden second architecture)

Runtime truth is apps/portal/app.py backed by Portal/TzenBoard runtime signals.
Contract exposure is /api/scrumai/*.
This contract does not introduce a second scheduler, worker fleet, or hidden control-plane.

Contract surfaces

GET /api/scrumai/guardrails
Contract: scrumai_capacity_cost_guardrails_v1
Purpose: explicit limits, behavior contract, environment expectations, and runtime mapping.
GET /api/scrumai/queue/state
Contract: scrumai_queue_state_v1
Purpose: dispatch readiness + summary.monthly_cost_guard.
GET /api/scrumai/agents/<agent_key>/runtime
Contract: scrumai_role_runtime_v1
Purpose: per-role dispatch readiness + monthly_cost_guard snapshot.
GET /api/scrumai/bootstrap
Contract: scrumai_control_plane_bootstrap_v1
Purpose: minimum control-plane route map including guardrails surface.

Hard limits (runtime-enforced contract)

Exposed via /api/scrumai/guardrails.hard_limits:

global_concurrency = 1
single_fallback_max_attempts = 2 (one primary + one fallback)
failure_max_attempts = 1 for timeout/crash terminal path
circuit_breaker_threshold from PORTAL_SCRUMAI_CIRCUIT_BREAKER_THRESHOLD (default 3)
live dispatch timeout bounds: 15..900 seconds
SCRUMAI_APPLY write scope allowlisted to docs/evidence/generated

Alert / block / fallback behavior contract

The runtime must name these behaviors explicitly (not implicit prose only):

| Behavior | Current trigger class | Current observable surfaces | Expected outcome | |---|---|---|---| | alert | blocked dispatch, failed terminal dispatch, breached monthly ceiling | /api/scrumai/hub latest dispatch issue; /api/scrumai/queue/state blocked reason code; /api/scrumai/guardrails monthly cost state | machine-readable reason code is visible to operators | | block | lane health gate failure, telemetry write failure, allowlist rejection, cost breach, breaker trip | /api/scrumai/queue/state, /api/scrumai/agents/<agent_key>/runtime, /api/scrumai/agents/<agent_key>/runtime/start | dispatch denied or terminalized with explicit reason | | fallback | primary dispatch failure on fallback-enabled lane | /api/scrumai/agents/<agent_key>/runtime/start and .../sync dispatch audit | max one fallback attempt, then terminal/blocked outcome |

Environment expectations

| Profile | Intended use | Guardrail expectation | |---|---|---| | prod | live operator lane | enforce strict single concurrency and bounded retry/fallback; expect monthly ceiling to be configured; treat force_dispatch as emergency-only override with audit | | dev_test | migration/test/proof lane | same hard safety caps (single concurrency + bounded retry/fallback), but ceiling may be disabled (0.0) and controlled force_dispatch can be used to exercise guardrails |

Environment resolution is exposed via /api/scrumai/guardrails.environment_expectations. Unknown APP_ENV values currently default to the dev_test profile and are marked explicitly in the payload.

Runtime knob mapping (contract -> live runtime)

Exposed via /api/scrumai/guardrails.runtime_knob_mapping:

| Contract knob | Runtime source today | Operator-tunable now | |---|---|---| | max_parallel_jobs | fixed in app.py (global_concurrency=1) | no | | single_fallback_max_attempts | SCRUMAI_SINGLE_FALLBACK_MAX_ATTEMPTS constant | no | | failure_max_attempts | SCRUMAI_FAILURE_MAX_ATTEMPTS constant | no | | circuit_breaker_threshold | PORTAL_SCRUMAI_CIRCUIT_BREAKER_THRESHOLD | yes | | monthly_cost_window_days | PORTAL_SCRUMAI_MONTHLY_COST_WINDOW_DAYS | yes | | monthly_cost_ceiling_usd | PORTAL_SCRUMAI_MONTHLY_COST_CEILING_USD | yes | | monthly_cost_review_day_utc | PORTAL_SCRUMAI_MONTHLY_COST_REVIEW_DAY_UTC | yes | | live_dispatch_timeout_seconds | live dispatch timeout clamp in app.py (15..900) | yes (request/env path) |

Known gaps (explicitly tracked)

Exposed via /api/scrumai/guardrails.known_gaps:

alert_delivery_sink_missing: alert semantics are visible in API contracts, but no pager/webhook sink is wired yet.
prod_force_dispatch_policy_not_hard_enforced: prod policy is documented as emergency-only, but runtime currently trusts caller force_dispatch + audit trail.
global_concurrency_not_operator_tunable: max parallel jobs remains fixed at 1 in code.

Monthly cost guard and review contract

Runtime ceiling controls

PORTAL_SCRUMAI_MONTHLY_COST_WINDOW_DAYS (default 30)
PORTAL_SCRUMAI_MONTHLY_COST_CEILING_USD (default 0.0 = disabled)
PORTAL_SCRUMAI_MONTHLY_COST_REVIEW_DAY_UTC (1..28, default 1)

Block behavior

Cost snapshot is computed from inference_events over configured UTC rolling window.
When ceiling_usd > 0 and spend_usd >= ceiling_usd, dispatch is marked blocked on queue/runtime surfaces.
Block reason includes MONTHLY_COST_CEILING_REACHED.
Forced dispatch remains explicit operator override via runtime request payload (telemetry.force_dispatch).

Monthly review process

1. Confirm spend and remote spend for current window. 2. Decide if ceiling/env knobs need adjustment for next month. 3. Record decision, owner, and timestamp in the evidence artifact.

Cadence: monthly on configured UTC review day.
Owners: Product Owner, Scrum Master, Dan.
Required inputs:
GET /api/scrumai/guardrails (monthly_cost_guard, review due fields)
inference_events spend totals for local/remote split
Required evidence artifact pattern:
docs/evidence/task-245-monthly-cost-review-YYYY-MM.md
Required checks:

Metadata

shared state: clean mode: manual-assisted ScrumAI public-safe

Runtime

Route: /knowledge/portal/shared/capacity-and-cost-guardrails
Source: knowledge/portal/shared/capacity-and-cost-guardrails.md
Status: clean
Reason: deterministic sync completed
Covered files: 5
Last generated: 2026-04-13T15:08:52.627606+00:00
Doc ID: portal-capacity-cost-guardrails

Covered files

apps/portal/app.py
apps/portal/tests/test_routes_positive.py
apps/portal/tests/test_scrumai_control_plane_api.py
knowledge/obsidian/TzenAI/ScrumAI/docs-v2/pre-build-missing-areas.md
knowledge/portal/shared/capacity-and-cost-guardrails.md